line_corpus#

Functions

block_shuffle

shuffle the possibly endless iterator by blocks Good shuffling over multiple files: block_shuffle(read_lines(files, shuffled_files=rand), rand=rand, block_size=100000) :param iter: the iterator we will yield shuffled items from :param block_size: size of memory to use for block shuffling :param rand: rand.shuffle will be used on the list block :return:

expand_files

expand the list of files and directories :param input_files: :param file_pattern: glob pattern for recursive example '.jsonl' for jsonl and jsonl.gz :param completed_files: these will not be returned in the final list :return:

gunzip_str

gzip_str

jsonl_files

jsonl_lines

jsonl_records

np2str

Convert numpy ndarray to compact string representation :param nda: numpy array :param dtype: numpy datatype to save the array as :return: base64 encoded string of numpy binary

read_lines

This takes a list (or single) input files and iterates over the lines in them :param input_files: Directory name or list of file names :param limit: maximum number of lines to read :param report_every: log info after this many lines :return:

read_open

Open text file for reading, assuming compression from extension :param input_file: :return:

read_records

str2np

Convert compact string representation of numpy ndarry to numpy vector :param s: base64 encoded string of numpy binary :param dtype: numpy datatype of the saved array :return: 1-D array (shape is not preserved)

write_open

Open text file for writing, assuming compression from extension :param output_file: :param mkdir: :return:

Classes

ShuffledWriter