example scripts¶

`split_multiline` Module¶

Run this script first to split the example data, which has multiple lines in some fields.

split_multiline.main()[source]¶

`clean` Module¶

Run this script to ‘clean’ the split up data.

clean.contrived_cleaner(data_items)[source]¶

Sort the data by the second row, enumerate it, apply title case to every field and include the original index and sorted in the in the row.

Parameters:	data_items – A sequence of unicode strings

clean.main(*args)[source]¶

Try it:

python clean.py

or:

python clean.py --pool True

or:

python clean.py --in-dir split_data_ml/data --out-dir my_clean_data

or:

python clean.py --pool True --in-dir split_data_ml/data

clean.run(in_dir, out_dir, pool)[source]¶

`split_non_multiline` Module¶

Run this script first to split the example data that does not have any multiple line in fields.

split_non_multiline.main()[source]¶

`shard_data` Module¶

Shard out data to files.

shard_data.main()[source]¶: Python 2 version

`shard_to_csv` Module¶

Shard out data to csv files.

shard_to_csv.main()[source]¶: From a source of data, shard it to csv files.

`shard_to_json` Module¶

Shard out data to files as rows of JSON.

shard_to_json.main()[source]¶: From a source of data, shard it to csv files.

`consume_csv_file` Module¶

Iteratively consume csv file.

consume_csv_file.main()[source]¶: Iterate over a the row of a csv file, extracting the data you desire.

`consume_many_csv_files` Module¶

Consume the items of a directory of csv files as if they were one file.

consume_many_csv_files.main()[source]¶: Consume many csv files as if one.

`concat_csv_files` Module¶

Concatenate all the csv files in a directory together.

concat_csv_files.main()[source]¶: Concatenate csv files together in no particular order.

`merge_small_csv_files` Module¶

Merge a number of homogeneous small csv files on a key.: Small means they all together fit in your computer’s memory.

merge_small_csv_files.main()[source]¶: Merge a number of homogeneous small csv files on a key. Small means they all together fit in your computer’s memory.

`tap_example` Module¶

Uses tap to get information from a stream of data in csv files.

`stream_searcher` Module¶

Uses tap to get information from a stream of data in csv files in designated directory with optional multi-processing.

stream_searcher.certain_kind_tap(data_items)[source]¶

As the stream of data items go by, get different kinds of information from them, in this case, the things that are fruit and metal, collecting each kind with a different spigot.

stream_tap doesn’t consume the data_items iterator by itself, it’s a generator and must be consumed by something else. In this case, it’s consuming the items by casting the iterator to a tuple, but doing it in batches.

Since each batch is not referenced by anything the memory can be freed by the garbage collector, so no matter the size of the data_items, only a little memory is needed. The only things retained are the results, which should just be a subset of the items and in this case, the getter functions only return a portion of each item it matches.

Parameters:	data_items – A sequence of unicode strings

stream_searcher.get_fruit(item)[source]¶

Get things that are fruit.

Returns:	thing of item if it’s a fruit

stream_searcher.get_metal(item)[source]¶

Get things that are metal.

Returns:	thing of item if it’s metal

stream_searcher.main(*args)[source]¶

Try it:

python stream_searcher.py

or:

python stream_searcher.py --pool True

or:

python stream_searcher.py --in-dir test_data/things_kinds

or:

python stream_searcher.py --pool True --in-dir test_data/things_kinds

stream_searcher.run(in_dir, pool)[source]¶: Run the composition of csv_file_consumer and information tap with the csv files in the input directory, and collect the results from each file and merge them together, printing both kinds of results.

stream_searcher.run_distribute(in_path)[source]¶: Run the composition of csv_file_consumer and information tap with the csv files in the input directory, and collect the results from each file and merge them together, printing both kinds of results.

stream_searcher.run_distribute_multi(in_dir)[source]¶: Run the composition of csv_file_consumer and information tap with the csv files in the input directory, and collect the results from each file and merge them together, printing both kinds of results.

example scripts¶

split_multiline Module¶

clean Module¶

split_non_multiline Module¶

shard_data Module¶

shard_to_csv Module¶

shard_to_json Module¶

consume_csv_file Module¶

consume_many_csv_files Module¶

concat_csv_files Module¶

merge_small_csv_files Module¶

tap_example Module¶

stream_searcher Module¶