line_corpus#
Functions
shuffle the possibly endless iterator by blocks Good shuffling over multiple files: block_shuffle(read_lines(files, shuffled_files=rand), rand=rand, block_size=100000) :param iter: the iterator we will yield shuffled items from :param block_size: size of memory to use for block shuffling :param rand: rand.shuffle will be used on the list block :return: |
|
expand the list of files and directories :param input_files: :param file_pattern: glob pattern for recursive example '.jsonl' for jsonl and jsonl.gz :param completed_files: these will not be returned in the final list :return: |
|
Convert numpy ndarray to compact string representation :param nda: numpy array :param dtype: numpy datatype to save the array as :return: base64 encoded string of numpy binary |
|
This takes a list (or single) input files and iterates over the lines in them :param input_files: Directory name or list of file names :param limit: maximum number of lines to read :param report_every: log info after this many lines :return: |
|
Open text file for reading, assuming compression from extension :param input_file: :return: |
|
Convert compact string representation of numpy ndarry to numpy vector :param s: base64 encoded string of numpy binary :param dtype: numpy datatype of the saved array :return: 1-D array (shape is not preserved) |
|
Open text file for writing, assuming compression from extension :param output_file: :param mkdir: :return: |
Classes