primeqa.mrc.processors.preprocessors.natural_questions.NaturalQuestionsPreProcessor#
- class primeqa.mrc.processors.preprocessors.natural_questions.NaturalQuestionsPreProcessor(*args, **kwargs)#
Bases:
primeqa.mrc.processors.preprocessors.base.BasePreProcessor
Preprocessor for NQ data. Note this preprocessor only supports single_context_multiple_passages=True and will override the value accordingly.
Methods
Process dataset examples to rename fields, create context and set answer offset.
Process NQ annotations into preprocessor format.
Annotate each training feature with a 'subsample_type' of type SubsampleType for subsampling.
Process eval examples into features.
Process training examples into features.
Subsample training features according to 'subsample_type':
Validate the data schema is correct for this preprocessor.
- adapt_dataset(dataset: datasets.arrow_dataset.Dataset, is_train: bool, keep_html: bool = True) datasets.arrow_dataset.Dataset #
Process dataset examples to rename fields, create context and set answer offset. :param dataset: Dataset to be processed. :param is_train: True for training otherwise False. :param keep_html: True if keep html token in context otherwise false.
- Returns
Precossed dataset.
- get_annotations(annotations, paragraphs)#
Process NQ annotations into preprocessor format. :param annotations: Annotations of NQ example. :param paragraphs: Passage_candidates of NQ example.
- Returns
Annotations in preprocessor format.
- label_features_for_subsampling(tokenized_examples: transformers.tokenization_utils_base.BatchEncoding, examples: datasets.arrow_dataset.Batch) transformers.tokenization_utils_base.BatchEncoding #
Annotate each training feature with a ‘subsample_type’ of type SubsampleType for subsampling.
- Parameters
tokenized_examples – featurized examples to annotate.
examples – original examples corresponding to the tokenized_examples features.
Returns: tokenized_examples annotated with ‘subsample_type’ for subsampling.
- process_eval(examples: datasets.arrow_dataset.Dataset) Tuple[datasets.arrow_dataset.Dataset, datasets.arrow_dataset.Dataset] #
Process eval examples into features.
- Parameters
examples – examples to process into features.
- Returns
tuple (examples, features) comprising examples adapted into standardized format and processed input features for model.
- process_train(examples: datasets.arrow_dataset.Dataset) Tuple[datasets.arrow_dataset.Dataset, datasets.arrow_dataset.Dataset] #
Process training examples into features.
- Parameters
examples – examples to process into features.
- Returns
tuple (examples, features) comprising examples adapted into standardized format and processed input features for model.
- subsample_features(dataset: datasets.arrow_dataset.Dataset) datasets.arrow_dataset.Dataset #
Subsample training features according to ‘subsample_type’:
All positive features are kept.
All negative features from an example that has an answer are kept with probability self._negative_sampling_prob_when_has_answer.
All negative features from an example that has no answer are kept with probability self._negative_sampling_prob_when_no_answer.
- Parameters
dataset – features to subsample.
- Returns
subsampled features.
- validate_schema(dataset: datasets.arrow_dataset.Dataset, is_train: bool, pre_adaptation: bool = True) None #
Validate the data schema is correct for this preprocessor.
- Parameters
dataset – data to validate schema of
is_train – whether the data is for training
pre_adaptation – whether adapt_dataset has been called. This allows for optional fields (e.g. example_id) to be imputed during adaptation.
- Returns
None
- Raises
ValueError – The data is not in the correct schema.