karld Package¶
_meta
Module¶
conversion_operators
Module¶
-
karld.conversion_operators.
apply_conversion_map
(conversion_map, entity)[source]¶ returns tuple of conversions
-
karld.conversion_operators.
apply_conversion_map_map
(conversion_map, entity)[source]¶ returns ordered dict of keys and converted values
-
karld.conversion_operators.
get_number_as_int
(number)[source]¶ Returns the first number from a string.
-
karld.conversion_operators.
join_stripped_gotten_value
(sep, getters, data)[source]¶ Join the values, coerced to str and stripped of whitespace padding, from entity, gotten with collection of getters, with the separator.
Parameters: - sep (str) – Separator of values.
- getters – collection of callables takes that data and returns value.
- data – argument for the getters
-
karld.conversion_operators.
join_stripped_values
(sep, collection_getter, data)[source]¶ Join the values, coerced to str and stripped of whitespace padding, from entity, gotten with collection_getter, with the separator.
Parameters: - sep (str) – Separator of values.
- collection_getter – callable takes that data and returns collection.
- data – argument for the collection_getter
iter_utils
Module¶
-
karld.iter_utils.
i_batch
(max_size, iterable)[source]¶ Generator that iteratively batches items to a max size and consumes the items iterable as each batch is yielded.
Parameters: - max_size (int) – Max size of each batch.
- iterable (iter) – An iterable
loadump
Module¶
-
karld.loadump.
dump_dicts_to_json_file
(file_name, dicts, buffering=10485760)[source]¶ writes each dictionary in the dicts iterable to a line of the file as json.
- NOTE: Deprecated. replaced by write_as_json, to match the signature
- of write_to_csv.
Parameters: buffering (int) – number of bytes to buffer files
-
karld.loadump.
ensure_dir
(directory)[source]¶ If directory doesn’t exist, make it.
Parameters: directory (str) – path to directory
-
karld.loadump.
ensure_file_path_dir
(file_path)[source]¶ Ensure the parent directory of the file path.
Parameters: file_path (str) – Path to file.
-
karld.loadump.
file_path_and_name
(path, base_name)[source]¶ Join the path and base_name and yield it and the base_name.
Parameters: - path (str) – Directory path
- base_name (str) – File name
Returns: tuple of file path and file name.
-
karld.loadump.
i_get_csv_data
(file_name, *args, **kwargs)[source]¶ A generator for reading a csv file.
-
karld.loadump.
i_get_json_data
(file_name, *args, **kwargs)[source]¶ A generator for reading file with json documents delimited by newlines.
-
karld.loadump.
i_read_buffered_file
(file_name, buffering=10485760, binary=True, py3_csv_read=False, encoding='utf-8')[source]¶ Generator of lines of a file name, with buffering for speed.
-
karld.loadump.
i_walk_dir_for_filepaths_names
(root_dir)[source]¶ Walks a directory yielding the paths and names of files.
Parameters: root_dir (str) – path to a directory.
-
karld.loadump.
i_walk_dir_for_paths_names
(root_dir)[source]¶ Walks a directory yielding the directory of files and names of files.
Parameters: root_dir (str) – path to a directory.
-
karld.loadump.
is_file_csv
(file_path_name)[source]¶ Is the file a csv file? Identify by extension.
Parameters: file_path_name (str) –
-
karld.loadump.
is_file_json
(file_path_name)[source]¶ Is the file a json file? Identify by extension.
Parameters: file_path_name (str) –
-
karld.loadump.
split_file
(file_path, out_dir=None, max_lines=200000, buffering=10485760, line_reader=<function raw_line_reader>, split_file_writer=<function split_file_output>, read_binary=True)[source]¶ Opens then shards the file.
Parameters: - file_path (str) – Path to the large input file.
- max_lines (int) – Max number of lines in each shard.
- out_dir (str) – Path of directory to put the shards.
- buffering (int) – number of bytes to buffer files
-
karld.loadump.
split_file_output
(name, data, out_dir, max_lines=1100, buffering=10485760)[source]¶ Split an iterable lines into groups and write each to a shard.
Parameters: - name (str) – Each shard will use this in it’s name.
- data (iter) – Iterable of data to write.
- out_dir (str) – Path to directory to write the shards.
- max_lines (int) – Max number of lines per shard.
- buffering (int) – number of bytes to buffer files
-
karld.loadump.
split_file_output_csv
(filename, data, out_dir=None, max_lines=1100, buffering=10485760, write_as_csv=<function write_as_csv>)[source]¶ - Split an iterable of csv serializable rows of data
- into groups and write each to a csv shard.
Parameters: buffering (int) – number of bytes to buffer files
-
karld.loadump.
split_file_output_json
(filename, dict_list, out_dir=None, max_lines=1100, buffering=10485760)[source]¶ - Split an iterable of JSON serializable rows of data
- into groups and write each to a shard.
Parameters: buffering (int) – number of bytes to buffer files
-
karld.loadump.
write_as_csv
(items, file_name, append=False, line_buffer_size=None, buffering=10485760, get_csv_row_writer=<function get_csv_row_writer>)[source]¶ Writes out items to a csv file in groups.
Parameters: - items – An iterable collection of collections.
- file_name – path to the output file.
- append – whether to append or overwrite the file.
- line_buffer_size – number of lines to write at a time.
- buffering (int) – number of bytes to buffer files
- get_csv_row_writer – callable that returns a csv row writer function, customize this for non-default options: custom_writer = partial(get_csv_row_writer, delimiter=”|”); write_as_csv(items, ‘my_out_file’, get_csv_row_writer=custom_writer)
merger
Module¶
-
karld.merger.
merge
(*iterables, **kwargs)[source]¶ Merge multiple sorted inputs into a single sorted output.
Similar to sorted(itertools.chain(*iterables)) but returns a generator, does not pull the data into memory all at once, and assumes that each of the input streams is already sorted (smallest to largest).
>>> list(merge([[2,1],[2,3],[2,5],[2,7]], [[2,0],[2,2],[2,4],[2,8]], [[2,5],[2,10],[2,15],[2,20]], [], [[2,25]]), key=itemgetter(-1)) [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
run_together
Module¶
-
karld.run_together.
csv_file_consumer
(csv_rows_consumer, file_path_name)[source]¶ Consume the file at file_path_name as a csv file, passing it through csv_rows_consumer.
Parameters: - csv_rows_consumer (callable) – consumes data_items yielding collection for each
- file_path_name (str, str) – path to input csv file
-
karld.run_together.
csv_file_to_file
(csv_rows_consumer, out_prefix, out_dir, file_path_name)[source]¶ Consume the file at file_path_name as a csv file, passing it through csv_rows_consumer, writing the results as a csv file into out_dir as the same name, lowered, and prefixed.
Parameters: - csv_rows_consumer (callable) – consumes data_items yielding collection for each
- out_prefix (str) – prefix out_file_name
- out_dir (str) – directory to write output file to
- file_path_name (str, str) – path to input csv file
-
karld.run_together.
csv_files_to_file
(csv_rows_consumer, out_prefix, out_dir, out_file_name, file_path_names)[source]¶ Consume the file at file_path_name as a csv file, passing it through csv_rows_consumer, writing the results as a csv file into out_dir as the same name, lowered, and prefixed.
Parameters: - csv_rows_consumer – consumes data_items yielding collection for each
- out_prefix (str) – prefix out_file_name
- out_dir (str) – Directory to write output file to.
- out_file_name (str) – Output file base name.
- file_path_names (str, str) – tuple of paths and basenames to input csv files
-
karld.run_together.
distribute_multi_run_to_runners
(items_func, in_dir, reader=None, walker=None, batch_size=1100, filter_func=None)[source]¶ With a multi-process pool, map batches of items from multiple files to an items processing function.
The reader callable should be as fast as possible to reduce data feeder cpu usage. It should do the minimal to produce discrete units of data, save any decoding for the items function.
Parameters: - items_func – Callable that takes multiple items of the data.
- reader – URL reader callable.
- walker – A generator that takes the in_dir URL and emits url, name tuples.
- batch_size – size of batches.
- filter_func – a function that returns True for desired paths names.
-
karld.run_together.
distribute_run_to_runners
(items_func, in_url, reader=None, batch_size=1100)[source]¶ With a multi-process pool, map batches of items from file to an items processing function.
The reader callable should be as fast as possible to reduce data feeder cpu usage. It should do the minimal to produce discrete units of data, save any decoding for the items function.
Parameters: - items_func – Callable that takes multiple items of the data.
- reader – URL reader callable.
- in_url – Url of content
- batch_size – size of batches.
-
karld.run_together.
multi_in_single_out
(rows_reader, rows_writer, rows_iter_consumer, out_url, in_urls_func)[source]¶ Multi input combiner.
Parameters: - rows_reader – function to read a file path and returns an iterator
- rows_writer – function to write values
- rows_iter_consumer – function takes iter. of iterators returns iter.
- out_url – url for the rows_writer to write to.
- in_urls_func – function generates iterator of input urls.
-
karld.run_together.
pool_run_files_to_files
(file_to_file, in_dir, filter_func=None)[source]¶ With a multi-process pool, map files in in_dir over file_to_file function.
Parameters: - file_to_file – callable that takes file paths.
- in_dir – path to process all files from.
- filter_func – Takes a tuple of path and base name of a file and returns a bool.
Returns: A list of return values from the map.
-
karld.run_together.
serial_run_files_to_files
(file_to_file, in_dir, filter_func=None)[source]¶ With a map files in in_dir over the file_to_file function.
Using this to debug your file_to_file function can make it easier.
Parameters: - file_to_file – callable that takes file paths.
- in_dir – path to process all files from.
- filter_func – Takes a tuple of path and base name of a file and returns a bool.
Returns: A list of return values from the map.
unicode_io
Module¶
How To Encoding¶
If you’ve tried something like unicode('က')
or u'hello ' + 'wကrld'
or ``str(u'wörld')
you will have seen UnicodeDecodeError
and UnicodeEncodeError. Likely, you’ve tried to
read csv data from a file and mixed the data with unicode
and everything went fine until it got to the line with
some word with an accent character and it broke and showed
UnicodeDecodeError: 'ascii' codec can't decode byte ...
What do you do?.
You’ve tried to write sequences of unicode strings
to a csv file and gotten
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in range(128)
What do you do?
Unicode handles characters used by different languages around the world, emojis, curly quotes and other glyphs. The textual data in different parts of the world can have various encodings designed to specifically handle their glyphs and unicode can represent them all, but the data must be decoded from that encoding to unicode.
The data was written to the file in a specific encoding,
either deliberately or because that was the default for
the software. Unfortunately, it’s up to the reader of the
data to know what the data was encoded in. It can be
connected to the language or locale it was created in.
Sometimes it can be inferred by the data. Many times
it’s written in utf-8, which can handle encoding all
the different chars that can be in a unicode string.
It does this by saving chars like '¥'
, or in unicode, u'\xa5'
,
as '\xc2\xa5'
. u'\xa5'.encode('utf-8')
results in '\xc2\xa5'
.
It uses more space, but can do it. By the way, '¥'
is possible in this code because the encoding is declared
at the top of this file.
String transformation methods, such as upper() or lower()
don’t work on these chars, like 'î'
or 'ê'
if they are
encoded as a utf-8 string, but will work if they are
decoded from utf-8 to unicode.
>>> print 'î'.upper()
î
>>> print u'î'.upper()
Î
>>> print 'ê'.upper()
ê
>>> print 'ê'.decode('utf-8').upper()
Ê
The python 2.7 csv module doesn’t work with unicode, so the text it parses must be encoded from unicode to a str using an encoding that will handle all the chars in the text. utf-8 is good choice, and thus is default.
The purpose of this module is to facilitate reading and writing csv data in whatever encoding your data is in.
-
karld.unicode_io.
csv_reader
(csv_data, dialect=<class csv.excel>, encoding='utf-8', **kwargs)[source]¶ Csv row generator that re-encodes to unicode from csv data with a given encoding.
- Utf-8 data in, unicode out. You may specify a different
- encoding of the incoming data.
Parameters: - csv_data – An iterable of str of the specified encoding.
- dialect – csv dialect
- encoding – The encoding of the given data.
-
karld.unicode_io.
get_csv_row_writer
(stream, dialect=<class csv.excel>, encoding='utf-8', **kwargs)[source]¶ Create a csv, encoding from unicode, row writer.
Use returned callable to write rows of unicode data to a stream, such as a file opened in write mode, in utf-8(or another) encoding.
my_row_data = [ [u'one', u'two'], [u'three', u'four'], ] with open('myfile.csv', 'wt') as myfile: unicode_row_writer = get_unicode_row_writer(myfile) for row in my_row_data: unicode_row_writer(row)
-
karld.unicode_io.
csv_unicode_reader
(unicode_csv_data, dialect=<class csv.excel>, **kwargs)[source]¶ Generator the reads serialized unicode csv data. Use this if you have a stream of data in unicode and you want to access the rows of the data as sequences encoded as unicode.
Unicode in, unicode out.
Parameters: - unicode_csv_data – An iterable of unicode strings.
- dialect – csv dialect