primeqa.ir.util.corpus_reader.DocumentCollection#

class primeqa.ir.util.corpus_reader.DocumentCollection(input_files: Union[str, bytes, os.PathLike], fieldnames=None)#

Bases: object

Methods

add_document_text_to_hit

Look up and add document text/title to the hits

load_corpus

Load the corpus tsv/csv or json

write_corpus_tsv

Write out the corpus in a format ready for indexing.

add_document_text_to_hit(hits: list)#

Look up and add document text/title to the hits

Parameters

hits – list of (document_id, score) tuples

Returns

list of dict {

’document’: document_dict, ‘score’: score

}

Return type

list[dict]

load_corpus()#

Load the corpus tsv/csv or json

write_corpus_tsv(output_file: str)#

Write out the corpus in a format ready for indexing.

Parameters

output_file (str) – tsv file where each row is in format ‘id text itle’