primeqa.mrc.run_mrc_utils.get_raw_datasets#

primeqa.mrc.run_mrc_utils.get_raw_datasets(fof, data_args, task_args, cache_dir, split='train')#

Load in multiple datasets which are either HuggingFace dataset or local data file.

Parameters

fof –
The file that specifies multiple datasets.
The fof file needs to be in json, jsonl, or csv format.
If in csv format, the columns of each line, separated by space, are as follows:
dataset name or path of local data file;

dataset config name or data file format;

sampling rate within range 0.0~1.0, e.g. 0.5 means 50% of the examples are randomly selected and loaded;

preprocessor name.

If column 2~4 are not given, default values will be used: data_args.dataset_config_name for dataset or data_args.data_file_format for data file; “1.0” for sampling rate; task_args.preprocessor for preprocessor name.
If in jsonl format, each line is a dictionary consisting of

{‘dataset’: dataset_name_or_path_of_data_file,
’config’: dataset_config_or_data_file_format, ‘sampling_rate’: sampling_rate, ‘preprocessor’: preprocessor_name}

Fields ‘config’, ‘sampling_rate’, and ‘preprocessor’ are also optional. Default value will be used if necessary.
If in json format, a list of dictionary shown above is expected.
data_args – data arguments containing dataset_config_name and data_file_format.
task_args – task arguments containing preprocessor.
cache_dir – cache dir for downloading datasets.
split – split of dataset to be loaded. No effect to local data file.

Returns

list of datasets loaded and sampled. preprocessors: list of preprocessor names.

Return type

raw_datasets

Raises

ValueError – Unable to load datasets or data files.