primeqa.mrc.processors.preprocessors.natural_questions.NaturalQuestionsPreProcessor#

class primeqa.mrc.processors.preprocessors.natural_questions.NaturalQuestionsPreProcessor(*args, **kwargs)#

Bases: primeqa.mrc.processors.preprocessors.base.BasePreProcessor

Preprocessor for NQ data. Note this preprocessor only supports single_context_multiple_passages=True and will override the value accordingly.

Methods

adapt_dataset

Process dataset examples to rename fields, create context and set answer offset.

get_annotations

Process NQ annotations into preprocessor format.

label_features_for_subsampling

Annotate each training feature with a 'subsample_type' of type SubsampleType for subsampling.

process_eval

Process eval examples into features.

process_train

Process training examples into features.

subsample_features

Subsample training features according to 'subsample_type':

validate_schema

Validate the data schema is correct for this preprocessor.

adapt_dataset(dataset: datasets.arrow_dataset.Dataset, is_train: bool, keep_html: bool = True) datasets.arrow_dataset.Dataset#

Process dataset examples to rename fields, create context and set answer offset. :param dataset: Dataset to be processed. :param is_train: True for training otherwise False. :param keep_html: True if keep html token in context otherwise false.

Returns

Precossed dataset.

get_annotations(annotations, paragraphs)#

Process NQ annotations into preprocessor format. :param annotations: Annotations of NQ example. :param paragraphs: Passage_candidates of NQ example.

Returns

Annotations in preprocessor format.

label_features_for_subsampling(tokenized_examples: transformers.tokenization_utils_base.BatchEncoding, examples: datasets.arrow_dataset.Batch) transformers.tokenization_utils_base.BatchEncoding#

Annotate each training feature with a ‘subsample_type’ of type SubsampleType for subsampling.

Parameters
  • tokenized_examples – featurized examples to annotate.

  • examples – original examples corresponding to the tokenized_examples features.

Returns: tokenized_examples annotated with ‘subsample_type’ for subsampling.

process_eval(examples: datasets.arrow_dataset.Dataset) Tuple[datasets.arrow_dataset.Dataset, datasets.arrow_dataset.Dataset]#

Process eval examples into features.

Parameters

examples – examples to process into features.

Returns

tuple (examples, features) comprising examples adapted into standardized format and processed input features for model.

process_train(examples: datasets.arrow_dataset.Dataset) Tuple[datasets.arrow_dataset.Dataset, datasets.arrow_dataset.Dataset]#

Process training examples into features.

Parameters

examples – examples to process into features.

Returns

tuple (examples, features) comprising examples adapted into standardized format and processed input features for model.

subsample_features(dataset: datasets.arrow_dataset.Dataset) datasets.arrow_dataset.Dataset#

Subsample training features according to ‘subsample_type’:

  • All positive features are kept.

  • All negative features from an example that has an answer are kept with probability self._negative_sampling_prob_when_has_answer.

  • All negative features from an example that has no answer are kept with probability self._negative_sampling_prob_when_no_answer.

Parameters

dataset – features to subsample.

Returns

subsampled features.

validate_schema(dataset: datasets.arrow_dataset.Dataset, is_train: bool, pre_adaptation: bool = True) None#

Validate the data schema is correct for this preprocessor.

Parameters
  • dataset – data to validate schema of

  • is_train – whether the data is for training

  • pre_adaptation – whether adapt_dataset has been called. This allows for optional fields (e.g. example_id) to be imputed during adaptation.

Returns

None

Raises

ValueError – The data is not in the correct schema.