Tutorial: Rerank search results using the ColBERT Reranker#

In this tutorial, we will learn how to use a Neural Reranker to rerank results from a BM25 search. The reranker is based on the ColBERT algorithm as described in Khattab et al., “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT” here.

For the purposes of making this tutorial easy to understand, we show the steps using a very small document collection. Note that this technique can be used to scale to millions of documents. We have tested upto 21 million Wikipedia passages!!!

The tutorial will take you through these three steps:

  1. Build a BM25 index over a small sample collection

  2. Query the BM25 index to obtain initial search results

  3. Rerank the initial results with a neural reranker to obtain the final search results

Step 0: Prepare a Colab Environment to run this tutorial on GPUs#

Make sure to “Enable GPU Runtime” by following this url. This step will make sure the tutorial runs faster.

Step 1: Install PrimeQA#

First, we need to include the required modules.

[ ]:
!pip install --upgrade pip

# Java 11 is required
!pip install install-jdk gdown

import jdk
import os
if not os.path.exists("/tmp/primeqa-jdk/jdk-11.0.19+7/"):
    jdk_dir = jdk.install('11', path="/tmp/primeqa-jdk")

# set the JAVA_HOME environment variable to point to Java 11
%env JAVA_HOME=/tmp/primeqa-jdk/jdk-11.0.19+7/

# install primeqa
!pip install --upgrade primeqa

Next we set up some paths. Please update the output_dir path to a location where you have write permissions.

[ ]:
# Setup paths
output_dir = "/tmp/primeqa-tutorial"

import os
# create output directory if it does not exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# setup some paths
downloaded_corpus_file = os.path.join(output_dir,"sample-document-store.csv")
collection_file = os.path.join(output_dir,"sample_collection.tsv")
reranker_model_path = os.path.join(output_dir, "DrDecr.dnn")

Step2: Download the sample corpus and the ColBERT DrDecr model#

[ ]:
# download the sample collection
! gdown  --id 1LULJRPgN_hfuI2kG-wH4FUwXCCdDh9zh --output {output_dir}/

# Download the reranker model
! wget -P {output_dir} https://huggingface.co/PrimeQA/DrDecr_XOR-TyDi_whitebox/resolve/main/DrDecr.dnn

Step 3: Pre-process your document collection#

[ ]:
# Preprocess the document colletion
from primeqa.ir.util.corpus_reader import DocumentCollection

collection = DocumentCollection(downloaded_corpus_file)
collection.write_corpus_tsv(collection_file)

! head -2 {collection_file}

Step 4: Now we will use the PrimeQA BM25 Indexer to build an index#

[ ]:
from primeqa.components.indexer.sparse import BM25Indexer

# Instantiate and configure the indexer
indexer = BM25Indexer(index_root=output_dir, index_name="sample_index_bm25")
indexer.load()   # IMPORTANT: required to configure

# Index the collection
indexer.index(collection=collection_file, overwrite=True)

Step 5: Start asking Questions#

We’re now ready to query the index we created.

Each search hit is a tuple consisting of (document_id,score).

[ ]:
from primeqa.components.retriever.sparse  import BM25Retriever
import json

# Exmaple questions
question = ["Why was Einstein awarded the Nobel Prize?"]

# Instantiate and configure the retriever
retriever = BM25Retriever(index_root=output_dir, index_name="sample_index_bm25", collection=None)
retriever.load()

# Search
hits = retriever.predict(question, max_num_documents=5)
print(json.dumps(hits,indent=2))

Step 6: Rerank the BM25 search results with a Neural Reranker#

We will be using the DrDecr model trained on Natural Questions and XOR TyDI. This is a model that has obtained SOTA results on the XORTyDI Retrieval task.

Here are the steps we will take:

1. Fetch the documents corresponding to the BM25 search hits
2. Initialize the PrimeQA ColBERTReranker
3. Rerank the BM25 search results

The reranker encodes the question and passage texts using the Reranker model and uses the representations to compute a similarity score.

We will use the DocumentCollection instance to fetch the document corresponding to the BM25 search results.

[ ]:
# Fetch documents
hits_to_be_reranked = collection.add_document_text_to_hit(hits[0])

print(json.dumps(hits_to_be_reranked,indent=2))

Step 7: Run the Reranker#

Next we will initialize the ColBERT Reranker with the DrDecr model and rerank the BM25 search results

[ ]:
# Import the ColBERT Reranker
from primeqa.components.reranker.colbert_reranker import ColBERTReranker

# Instantiate the ColBERTReranker
reranker = ColBERTReranker(reranker_model_path)
reranker.load()

# rerank the BM25 search result and output the top 3 hits
reranked_results = reranker.rerank(question, [hits_to_be_reranked], max_num_documents=3)
print(json.dumps(reranked_results,indent=2))

Step 8: Print the top ranked result before and after reranking#

[ ]:
# print top ranked results
print(f'QUESTON: {question}')

print("\n========Top search result BEFORE reranking")
print(hits_to_be_reranked[0]['document']['title'],  hits_to_be_reranked[0]['document']['text'])

print("\n========Top search result AFTER reranking")
print(reranked_results[0][0]['document']['title'],  reranked_results[0][0]['document']['text'])

Congratulations !!! ### 🎉✨🎊🥳#

You have successfully completed the Reranker tutorial !