example scripts¶
split_multiline
Module¶
Run this script first to split the example data, which has multiple lines in some fields.
clean
Module¶
Run this script to ‘clean’ the split up data.
-
clean.
contrived_cleaner
(data_items)[source]¶ Sort the data by the second row, enumerate it, apply title case to every field and include the original index and sorted in the in the row.
Parameters: data_items – A sequence of unicode strings
split_non_multiline
Module¶
Run this script first to split the example data that does not have any multiple line in fields.
shard_to_json
Module¶
Shard out data to files as rows of JSON.
From a source of data, shard it to csv files.
consume_csv_file
Module¶
Iteratively consume csv file.
consume_many_csv_files
Module¶
Consume the items of a directory of csv files as if they were one file.
concat_csv_files
Module¶
Concatenate all the csv files in a directory together.
merge_small_csv_files
Module¶
- Merge a number of homogeneous small csv files on a key.
- Small means they all together fit in your computer’s memory.
tap_example
Module¶
Uses tap to get information from a stream of data in csv files.
stream_searcher
Module¶
Uses tap to get information from a stream of data in csv files in designated directory with optional multi-processing.
-
stream_searcher.
certain_kind_tap
(data_items)[source]¶ As the stream of data items go by, get different kinds of information from them, in this case, the things that are fruit and metal, collecting each kind with a different spigot.
stream_tap doesn’t consume the data_items iterator by itself, it’s a generator and must be consumed by something else. In this case, it’s consuming the items by casting the iterator to a tuple, but doing it in batches.
Since each batch is not referenced by anything the memory can be freed by the garbage collector, so no matter the size of the data_items, only a little memory is needed. The only things retained are the results, which should just be a subset of the items and in this case, the getter functions only return a portion of each item it matches.
Parameters: data_items – A sequence of unicode strings
-
stream_searcher.
get_fruit
(item)[source]¶ Get things that are fruit.
Returns: thing of item if it’s a fruit
-
stream_searcher.
get_metal
(item)[source]¶ Get things that are metal.
Returns: thing of item if it’s metal
-
stream_searcher.
main
(*args)[source]¶ Try it:
python stream_searcher.py
or:
python stream_searcher.py --pool True
or:
python stream_searcher.py --in-dir test_data/things_kinds
or:
python stream_searcher.py --pool True --in-dir test_data/things_kinds
-
stream_searcher.
run
(in_dir, pool)[source]¶ Run the composition of csv_file_consumer and information tap with the csv files in the input directory, and collect the results from each file and merge them together, printing both kinds of results.