example scripts

split_multiline Module

Run this script first to split the example data, which has multiple lines in some fields.

split_multiline.main()[source]

clean Module

Run this script to ‘clean’ the split up data.

clean.contrived_cleaner(data_items)[source]

Sort the data by the second row, enumerate it, apply title case to every field and include the original index and sorted in the in the row.

Parameters:data_items – A sequence of unicode strings
clean.main(*args)[source]

Try it:

python clean.py

or:

python clean.py --pool True

or:

python clean.py --in-dir split_data_ml/data --out-dir my_clean_data

or:

python clean.py --pool True --in-dir split_data_ml/data
clean.run(in_dir, out_dir, pool)[source]

split_non_multiline Module

Run this script first to split the example data that does not have any multiple line in fields.

split_non_multiline.main()[source]

shard_data Module

Shard out data to files.

shard_data.main()[source]

Python 2 version

shard_to_csv Module

Shard out data to csv files.

shard_to_csv.main()[source]

From a source of data, shard it to csv files.

shard_to_json Module

Shard out data to files as rows of JSON.

shard_to_json.main()[source]

From a source of data, shard it to csv files.

consume_csv_file Module

Iteratively consume csv file.

consume_csv_file.main()[source]

Iterate over a the row of a csv file, extracting the data you desire.

consume_many_csv_files Module

Consume the items of a directory of csv files as if they were one file.

consume_many_csv_files.main()[source]

Consume many csv files as if one.

concat_csv_files Module

Concatenate all the csv files in a directory together.

concat_csv_files.main()[source]

Concatenate csv files together in no particular order.

merge_small_csv_files Module

Merge a number of homogeneous small csv files on a key.
Small means they all together fit in your computer’s memory.
merge_small_csv_files.main()[source]

Merge a number of homogeneous small csv files on a key. Small means they all together fit in your computer’s memory.

tap_example Module

Uses tap to get information from a stream of data in csv files.

stream_searcher Module

Uses tap to get information from a stream of data in csv files in designated directory with optional multi-processing.

stream_searcher.certain_kind_tap(data_items)[source]

As the stream of data items go by, get different kinds of information from them, in this case, the things that are fruit and metal, collecting each kind with a different spigot.

stream_tap doesn’t consume the data_items iterator by itself, it’s a generator and must be consumed by something else. In this case, it’s consuming the items by casting the iterator to a tuple, but doing it in batches.

Since each batch is not referenced by anything the memory can be freed by the garbage collector, so no matter the size of the data_items, only a little memory is needed. The only things retained are the results, which should just be a subset of the items and in this case, the getter functions only return a portion of each item it matches.

Parameters:data_items – A sequence of unicode strings
stream_searcher.get_fruit(item)[source]

Get things that are fruit.

Returns:thing of item if it’s a fruit
stream_searcher.get_metal(item)[source]

Get things that are metal.

Returns:thing of item if it’s metal
stream_searcher.main(*args)[source]

Try it:

python stream_searcher.py

or:

python stream_searcher.py --pool True

or:

python stream_searcher.py --in-dir test_data/things_kinds

or:

python stream_searcher.py --pool True --in-dir test_data/things_kinds
stream_searcher.run(in_dir, pool)[source]

Run the composition of csv_file_consumer and information tap with the csv files in the input directory, and collect the results from each file and merge them together, printing both kinds of results.

stream_searcher.run_distribute(in_path)[source]

Run the composition of csv_file_consumer and information tap with the csv files in the input directory, and collect the results from each file and merge them together, printing both kinds of results.

stream_searcher.run_distribute_multi(in_dir)[source]

Run the composition of csv_file_consumer and information tap with the csv files in the input directory, and collect the results from each file and merge them together, printing both kinds of results.