karld Package

karld Package

karld.__init__.is_py3()[source]

_meta Module

conversion_operators Module

karld.conversion_operators.apply_conversion_map(conversion_map, entity)[source]

returns tuple of conversions

karld.conversion_operators.apply_conversion_map_map(conversion_map, entity)[source]

returns ordered dict of keys and converted values

karld.conversion_operators.get_number_as_int(number)[source]

Returns the first number from a string.

karld.conversion_operators.join_stripped_gotten_value(sep, getters, data)[source]

Join the values, coerced to str and stripped of whitespace padding, from entity, gotten with collection of getters, with the separator.

Parameters:
  • sep (str) – Separator of values.
  • getters – collection of callables takes that data and returns value.
  • data – argument for the getters
karld.conversion_operators.join_stripped_values(sep, collection_getter, data)[source]

Join the values, coerced to str and stripped of whitespace padding, from entity, gotten with collection_getter, with the separator.

Parameters:
  • sep (str) – Separator of values.
  • collection_getter – callable takes that data and returns collection.
  • data – argument for the collection_getter

iter_utils Module

karld.iter_utils.i_batch(max_size, iterable)[source]

Generator that iteratively batches items to a max size and consumes the items iterable as each batch is yielded.

Parameters:
  • max_size (int) – Max size of each batch.
  • iterable (iter) – An iterable
karld.iter_utils.yield_getter_of(getter_maker, iterator)[source]

Iteratively map iterator over the result of getter_maker.

Parameters:
  • getter_maker – function that returns a getter function.
  • iterator – An iterator.
karld.iter_utils.yield_nth_of(nth, iterator)[source]

For an iterator that returns sequences, yield the nth value of each.

Parameters:
  • nth (int) – Index desired column of each sequence.
  • iterator – iterator of sequences.

loadump Module

karld.loadump.dump_dicts_to_json_file(file_name, dicts, buffering=10485760)[source]

writes each dictionary in the dicts iterable to a line of the file as json.

NOTE: Deprecated. replaced by write_as_json, to match the signature
of write_to_csv.
Parameters:buffering (int) – number of bytes to buffer files
karld.loadump.ensure_dir(directory)[source]

If directory doesn’t exist, make it.

Parameters:directory (str) – path to directory
karld.loadump.ensure_file_path_dir(file_path)[source]

Ensure the parent directory of the file path.

Parameters:file_path (str) – Path to file.
karld.loadump.file_path_and_name(path, base_name)[source]

Join the path and base_name and yield it and the base_name.

Parameters:
  • path (str) – Directory path
  • base_name (str) – File name
Returns:

tuple of file path and file name.

karld.loadump.i_get_csv_data(file_name, *args, **kwargs)[source]

A generator for reading a csv file.

karld.loadump.i_get_json_data(file_name, *args, **kwargs)[source]

A generator for reading file with json documents delimited by newlines.

karld.loadump.i_read_buffered_file(file_name, buffering=10485760, binary=True, py3_csv_read=False, encoding='utf-8')[source]

Generator of lines of a file name, with buffering for speed.

karld.loadump.i_walk_dir_for_filepaths_names(root_dir)[source]

Walks a directory yielding the paths and names of files.

Parameters:root_dir (str) – path to a directory.
karld.loadump.i_walk_dir_for_paths_names(root_dir)[source]

Walks a directory yielding the directory of files and names of files.

Parameters:root_dir (str) – path to a directory.
karld.loadump.identity(*args)[source]
karld.loadump.is_file_csv(file_path_name)[source]

Is the file a csv file? Identify by extension.

Parameters:file_path_name (str) –
karld.loadump.is_file_json(file_path_name)[source]

Is the file a json file? Identify by extension.

Parameters:file_path_name (str) –
karld.loadump.raw_line_reader(file_object)[source]
karld.loadump.split_file(file_path, out_dir=None, max_lines=200000, buffering=10485760, line_reader=<function raw_line_reader>, split_file_writer=<function split_file_output>, read_binary=True)[source]

Opens then shards the file.

Parameters:
  • file_path (str) – Path to the large input file.
  • max_lines (int) – Max number of lines in each shard.
  • out_dir (str) – Path of directory to put the shards.
  • buffering (int) – number of bytes to buffer files
karld.loadump.split_file_output(name, data, out_dir, max_lines=1100, buffering=10485760)[source]

Split an iterable lines into groups and write each to a shard.

Parameters:
  • name (str) – Each shard will use this in it’s name.
  • data (iter) – Iterable of data to write.
  • out_dir (str) – Path to directory to write the shards.
  • max_lines (int) – Max number of lines per shard.
  • buffering (int) – number of bytes to buffer files
karld.loadump.split_file_output_csv(filename, data, out_dir=None, max_lines=1100, buffering=10485760, write_as_csv=<function write_as_csv>)[source]
Split an iterable of csv serializable rows of data
into groups and write each to a csv shard.
Parameters:buffering (int) – number of bytes to buffer files
karld.loadump.split_file_output_json(filename, dict_list, out_dir=None, max_lines=1100, buffering=10485760)[source]
Split an iterable of JSON serializable rows of data
into groups and write each to a shard.
Parameters:buffering (int) – number of bytes to buffer files
karld.loadump.write_as_csv(items, file_name, append=False, line_buffer_size=None, buffering=10485760, get_csv_row_writer=<function get_csv_row_writer>)[source]

Writes out items to a csv file in groups.

Parameters:
  • items – An iterable collection of collections.
  • file_name – path to the output file.
  • append – whether to append or overwrite the file.
  • line_buffer_size – number of lines to write at a time.
  • buffering (int) – number of bytes to buffer files
  • get_csv_row_writer – callable that returns a csv row writer function, customize this for non-default options: custom_writer = partial(get_csv_row_writer, delimiter=”|”); write_as_csv(items, ‘my_out_file’, get_csv_row_writer=custom_writer)
karld.loadump.write_as_json(items, file_name, buffering=10485760)[source]

writes each dictionary in the dicts iterable to a line of the file as json.

Parameters:
  • items – A sequence of json dumpable objects.
  • file_name – the path of the output file.
  • buffering (int) – number of bytes to buffer files

merger Module

karld.merger.get_first_if_any(values)[source]
karld.merger.get_first_type_instance_of_group(instance_type, group)[source]
karld.merger.i_get_multi_groups(iterables, key=None)[source]
karld.merger.i_merge_group_sorted(iterables, key=None)[source]
karld.merger.merge(*iterables, **kwargs)[source]

Merge multiple sorted inputs into a single sorted output.

Similar to sorted(itertools.chain(*iterables)) but returns a generator, does not pull the data into memory all at once, and assumes that each of the input streams is already sorted (smallest to largest).

>>> list(merge([[2,1],[2,3],[2,5],[2,7]],
[[2,0],[2,2],[2,4],[2,8]],
[[2,5],[2,10],[2,15],[2,20]],
[], [[2,25]]), key=itemgetter(-1))
[0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
karld.merger.sort_iterables(iterables, key=None)[source]
karld.merger.sort_merge_group(iterables, key=None)[source]
karld.merger.sorted_by(key, items)[source]

run_together Module

karld.run_together.csv_file_consumer(csv_rows_consumer, file_path_name)[source]

Consume the file at file_path_name as a csv file, passing it through csv_rows_consumer.

Parameters:
  • csv_rows_consumer (callable) – consumes data_items yielding collection for each
  • file_path_name (str, str) – path to input csv file
karld.run_together.csv_file_to_file(csv_rows_consumer, out_prefix, out_dir, file_path_name)[source]

Consume the file at file_path_name as a csv file, passing it through csv_rows_consumer, writing the results as a csv file into out_dir as the same name, lowered, and prefixed.

Parameters:
  • csv_rows_consumer (callable) – consumes data_items yielding collection for each
  • out_prefix (str) – prefix out_file_name
  • out_dir (str) – directory to write output file to
  • file_path_name (str, str) – path to input csv file
karld.run_together.csv_files_to_file(csv_rows_consumer, out_prefix, out_dir, out_file_name, file_path_names)[source]

Consume the file at file_path_name as a csv file, passing it through csv_rows_consumer, writing the results as a csv file into out_dir as the same name, lowered, and prefixed.

Parameters:
  • csv_rows_consumer – consumes data_items yielding collection for each
  • out_prefix (str) – prefix out_file_name
  • out_dir (str) – Directory to write output file to.
  • out_file_name (str) – Output file base name.
  • file_path_names (str, str) – tuple of paths and basenames to input csv files
karld.run_together.distribute_multi_run_to_runners(items_func, in_dir, reader=None, walker=None, batch_size=1100, filter_func=None)[source]

With a multi-process pool, map batches of items from multiple files to an items processing function.

The reader callable should be as fast as possible to reduce data feeder cpu usage. It should do the minimal to produce discrete units of data, save any decoding for the items function.

Parameters:
  • items_func – Callable that takes multiple items of the data.
  • reader – URL reader callable.
  • walker – A generator that takes the in_dir URL and emits url, name tuples.
  • batch_size – size of batches.
  • filter_func – a function that returns True for desired paths names.
karld.run_together.distribute_run_to_runners(items_func, in_url, reader=None, batch_size=1100)[source]

With a multi-process pool, map batches of items from file to an items processing function.

The reader callable should be as fast as possible to reduce data feeder cpu usage. It should do the minimal to produce discrete units of data, save any decoding for the items function.

Parameters:
  • items_func – Callable that takes multiple items of the data.
  • reader – URL reader callable.
  • in_url – Url of content
  • batch_size – size of batches.
karld.run_together.multi_in_single_out(rows_reader, rows_writer, rows_iter_consumer, out_url, in_urls_func)[source]

Multi input combiner.

Parameters:
  • rows_reader – function to read a file path and returns an iterator
  • rows_writer – function to write values
  • rows_iter_consumer – function takes iter. of iterators returns iter.
  • out_url – url for the rows_writer to write to.
  • in_urls_func – function generates iterator of input urls.
karld.run_together.pool_run_files_to_files(file_to_file, in_dir, filter_func=None)[source]

With a multi-process pool, map files in in_dir over file_to_file function.

Parameters:
  • file_to_file – callable that takes file paths.
  • in_dir – path to process all files from.
  • filter_func – Takes a tuple of path and base name of a file and returns a bool.
Returns:

A list of return values from the map.

karld.run_together.serial_run_files_to_files(file_to_file, in_dir, filter_func=None)[source]

With a map files in in_dir over the file_to_file function.

Using this to debug your file_to_file function can make it easier.

Parameters:
  • file_to_file – callable that takes file paths.
  • in_dir – path to process all files from.
  • filter_func – Takes a tuple of path and base name of a file and returns a bool.
Returns:

A list of return values from the map.

unicode_io Module

How To Encoding

If you’ve tried something like unicode('က') or u'hello ' + 'wကrld' or ``str(u'wörld') you will have seen UnicodeDecodeError and UnicodeEncodeError. Likely, you’ve tried to read csv data from a file and mixed the data with unicode and everything went fine until it got to the line with some word with an accent character and it broke and showed UnicodeDecodeError: 'ascii' codec can't decode byte ... What do you do?. You’ve tried to write sequences of unicode strings to a csv file and gotten UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in range(128) What do you do?

Unicode handles characters used by different languages around the world, emojis, curly quotes and other glyphs. The textual data in different parts of the world can have various encodings designed to specifically handle their glyphs and unicode can represent them all, but the data must be decoded from that encoding to unicode.

The data was written to the file in a specific encoding, either deliberately or because that was the default for the software. Unfortunately, it’s up to the reader of the data to know what the data was encoded in. It can be connected to the language or locale it was created in. Sometimes it can be inferred by the data. Many times it’s written in utf-8, which can handle encoding all the different chars that can be in a unicode string. It does this by saving chars like '¥', or in unicode, u'\xa5', as '\xc2\xa5'. u'\xa5'.encode('utf-8') results in '\xc2\xa5'. It uses more space, but can do it. By the way, '¥' is possible in this code because the encoding is declared at the top of this file.

String transformation methods, such as upper() or lower() don’t work on these chars, like 'î' or 'ê' if they are encoded as a utf-8 string, but will work if they are decoded from utf-8 to unicode.

>>> print 'î'.upper()
î
>>> print u'î'.upper()
Î
>>> print 'ê'.upper()
ê
>>> print 'ê'.decode('utf-8').upper()
Ê

The python 2.7 csv module doesn’t work with unicode, so the text it parses must be encoded from unicode to a str using an encoding that will handle all the chars in the text. utf-8 is good choice, and thus is default.

The purpose of this module is to facilitate reading and writing csv data in whatever encoding your data is in.

karld.unicode_io.csv_reader(csv_data, dialect=<class csv.excel>, encoding='utf-8', **kwargs)[source]

Csv row generator that re-encodes to unicode from csv data with a given encoding.

Utf-8 data in, unicode out. You may specify a different
encoding of the incoming data.
Parameters:
  • csv_data – An iterable of str of the specified encoding.
  • dialect – csv dialect
  • encoding – The encoding of the given data.
karld.unicode_io.get_csv_row_writer(stream, dialect=<class csv.excel>, encoding='utf-8', **kwargs)[source]

Create a csv, encoding from unicode, row writer.

Use returned callable to write rows of unicode data to a stream, such as a file opened in write mode, in utf-8(or another) encoding.

my_row_data = [
    [u'one', u'two'],
    [u'three', u'four'],
]

with open('myfile.csv', 'wt') as myfile:
    unicode_row_writer = get_unicode_row_writer(myfile)
    for row in my_row_data:
        unicode_row_writer(row)
karld.unicode_io.csv_unicode_reader(unicode_csv_data, dialect=<class csv.excel>, **kwargs)[source]

Generator the reads serialized unicode csv data. Use this if you have a stream of data in unicode and you want to access the rows of the data as sequences encoded as unicode.

Unicode in, unicode out.

Parameters:
  • unicode_csv_data – An iterable of unicode strings.
  • dialect – csv dialect

path Module

karld.path.i_walk_csv_paths(input_dir)[source]

Generator to yield the paths of csv files in the input directory.

Parameters:input_dir – path to the input directory
karld.path.i_walk_json_paths(input_dir)[source]

Generator to yield the paths of json files in the input directory.

Parameters:input_dir – path to the input directory

io Module

tap Module