Tutorial: Retrieval of documents from a corpus using Neural Information Retrieval (IR)#

In this tutorial you’ll learn how to use a popular Neural IR system called DPR [Karpukhin2020].

Step 0: Prepare a Colab Environment to run this tutorial on GPUs#

Make sure to “Enable GPU Runtime” by following this url. This step will make sure the tutorial runs faster.

Step 1: Install PrimeQA#

First, we need to include the required modules.

[ ]:
! pip install --upgrade primeqa

Step 2: Pre-process your document collection here to be ready to be stored in your Neural Search Index.#

In this step we download a publicly available .csv file from a Google Drive location and save it as .tsv.

[ ]:
# save your input document as a .tsv
import pandas as pd
url='https://drive.google.com/file/d/1LULJRPgN_hfuI2kG-wH4FUwXCCdDh9zh/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url)
df.to_csv('input.tsv', sep='\t', columns = ['text', 'title'])

Step 3: Init – Initialize your model. In PrimeQA for searching through your corpus, we use a class called SearchableCorpus.#

For DPR, you need to point to a question and context encoder models available via the HuggingFace model hub.

[ ]:
from primeqa.components import SearchableCorpus
collection = SearchableCorpus(context_encoder_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_ctx_encoder",
                              query_encoder_name_or_path="PrimeQA/XOR-TyDi_monolingual_DPR_qry_encoder",
                              batch_size=64, top_k=10)

Step 4: Add your documents into the searchable corpus.#

The input.tsv file can be added to the searchable corpus and it assumes the following format as needed by DPR:

id \t text \t title_of_document

Note: since DPR is based on an encoder language model the typical sequence length is 512 max sub-word tokens. Make sure your documents are split into text length of ~220 words.

[ ]:
collection.add_documents('input.tsv')

Step 5: Search – start asking questions.#

Your queries can be a list. You can also retrieve the scores of retrieved documents.

[ ]:
queries = ['When was Idaho split in two?' , 'Who was Danny Nozel']
retrieved_doc_ids, passages = collection.search(queries)
import json
print(json.dumps(passages, indent = 4))

Congratulations 🎉✨🎊🥳 !! You can now index documents with a popular Neural IR technique.