primeqa.mrc.processors.preprocessors.tydiboolqa_bpes.TyDiBoolQAPreprocessor#
- class primeqa.mrc.processors.preprocessors.tydiboolqa_bpes.TyDiBoolQAPreprocessor(*args, **kwargs)#
Bases:
primeqa.mrc.processors.preprocessors.tydiqa.TyDiQAPreprocessor
Methods
Convert dataset into standardized format accepted by the preprocessor.
Annotate each training feature with a 'subsample_type' of type SubsampleType for subsampling.
Process eval examples into features.
sample comparison of features and updated_features on Tydi train start and end positions are introduced on the two items with target_type==3 (YES) and not on other items
Subsample training features according to 'subsample_type':
Validate the data schema is correct for this preprocessor.
- adapt_dataset(dataset: datasets.arrow_dataset.Dataset, is_train: bool) datasets.arrow_dataset.Dataset #
Convert dataset into standardized format accepted by the preprocessor. This method will likely need to be overridden when subclassing.
- Parameters
dataset – data to adapt.
is_train – whether the dataset is for training.
- Returns
Adapted dataset.
- label_features_for_subsampling(tokenized_examples: transformers.tokenization_utils_base.BatchEncoding, examples: datasets.arrow_dataset.Batch) transformers.tokenization_utils_base.BatchEncoding #
Annotate each training feature with a ‘subsample_type’ of type SubsampleType for subsampling.
- Parameters
tokenized_examples – featurized examples to annotate.
examples – original examples corresponding to the tokenized_examples features.
Returns: tokenized_examples annotated with ‘subsample_type’ for subsampling.
- process_eval(examples: datasets.arrow_dataset.Dataset) Tuple[datasets.arrow_dataset.Dataset, datasets.arrow_dataset.Dataset] #
Process eval examples into features.
- Parameters
examples – examples to process into features.
- Returns
tuple (examples, features) comprising examples adapted into standardized format and processed input features for model.
- process_train(examples: datasets.arrow_dataset.Dataset) Tuple[datasets.arrow_dataset.Dataset, datasets.arrow_dataset.Dataset] #
sample comparison of features and updated_features on Tydi train start and end positions are introduced on the two items with target_type==3 (YES) and not on other items
In [28]: features[0:20][‘start_positions’] Out[28]: [0, 0, 0, 0, 0, 0, 14, 0, 0, 291, 0, 246, 0, 189, 0, 101, 0, 0, 0, 0] In [29]: updated_features[0:20][‘start_positions’] Out[29]: [0, 0, 0, 0, 0, 0, 14, 0, 0, 291, 0, 246, 0, 189, 0, 101, 0, 191, 17, 0] In [30]: features[0:20][‘end_positions’] Out[30]: [0, 0, 0, 0, 0, 0, 26, 0, 0, 302, 0, 249, 0, 193, 0, 106, 0, 0, 0, 0] In [31]: updated_features[0:20][‘end_positions’] Out[31]: [0, 0, 0, 0, 0, 0, 26, 0, 0, 302, 0, 249, 0, 193, 0, 106, 0, 511, 278, 0] In [32]: features[0:20][‘target_type’] Out[32]: [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 2, 1, 0, 1, 0, 3, 3, 0] In [33]: updated_features[0:20][‘target_type’] Out[33]: [0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 2, 1, 0, 1, 0, 3, 3, 0]
- subsample_features(dataset: datasets.arrow_dataset.Dataset) datasets.arrow_dataset.Dataset #
Subsample training features according to ‘subsample_type’:
All positive features are kept.
All negative features from an example that has an answer are kept with probability self._negative_sampling_prob_when_has_answer.
All negative features from an example that has no answer are kept with probability self._negative_sampling_prob_when_no_answer.
- Parameters
dataset – features to subsample.
- Returns
subsampled features.
- validate_schema(dataset: datasets.arrow_dataset.Dataset, is_train: bool, pre_adaptation: bool = True) None #
Validate the data schema is correct for this preprocessor.
- Parameters
dataset – data to validate schema of
is_train – whether the data is for training
pre_adaptation – whether adapt_dataset has been called. This allows for optional fields (e.g. example_id) to be imputed during adaptation.
- Returns
None
- Raises
ValueError – The data is not in the correct schema.