primeqa.mrc.run_mrc_utils.get_raw_datasets#
- primeqa.mrc.run_mrc_utils.get_raw_datasets(fof, data_args, task_args, cache_dir, split='train')#
Load in multiple datasets which are either HuggingFace dataset or local data file.
- Parameters
fof –
- The file that specifies multiple datasets.
The fof file needs to be in json, jsonl, or csv format.
- If in csv format, the columns of each line, separated by space, are as follows:
dataset name or path of local data file;
dataset config name or data file format;
sampling rate within range 0.0~1.0, e.g. 0.5 means 50% of the examples are randomly selected and loaded;
preprocessor name.
If column 2~4 are not given, default values will be used: data_args.dataset_config_name for dataset or data_args.data_file_format for data file; “1.0” for sampling rate; task_args.preprocessor for preprocessor name.
- If in jsonl format, each line is a dictionary consisting of
- {‘dataset’: dataset_name_or_path_of_data_file,
’config’: dataset_config_or_data_file_format, ‘sampling_rate’: sampling_rate, ‘preprocessor’: preprocessor_name}
Fields ‘config’, ‘sampling_rate’, and ‘preprocessor’ are also optional. Default value will be used if necessary.
If in json format, a list of dictionary shown above is expected.
data_args – data arguments containing dataset_config_name and data_file_format.
task_args – task arguments containing preprocessor.
cache_dir – cache dir for downloading datasets.
split – split of dataset to be loaded. No effect to local data file.
- Returns
list of datasets loaded and sampled. preprocessors: list of preprocessor names.
- Return type
raw_datasets
- Raises
ValueError – Unable to load datasets or data files.