example scripts¶
split_multiline Module¶
Run this script first to split the example data, which has multiple lines in some fields.
clean Module¶
Run this script to ‘clean’ the split up data.
- clean.contrived_cleaner(data_items)[source]¶
Sort the data by the second row, enumerate it, apply title case to every field and include the original index and sorted in the in the row.
Parameters: data_items – A sequence of unicode strings
split_non_multiline Module¶
Run this script first to split the example data that does not have any multiple line in fields.
shard_to_json Module¶
Shard out data to files as rows of JSON.
From a source of data, shard it to csv files.
consume_csv_file Module¶
Iteratively consume csv file.
consume_many_csv_files Module¶
Consume the items of a directory of csv files as if they were one file.
concat_csv_files Module¶
Concatenate all the csv files in a directory together.
merge_small_csv_files Module¶
- Merge a number of homogeneous small csv files on a key.
- Small means they all together fit in your computer’s memory.
tap_example Module¶
Uses tap to get information from a stream of data in csv files.
- tap_example.certain_kind_tap(data_items)[source]¶
Parameters: data_items – A sequence of unicode strings
stream_searcher Module¶
Uses tap to get information from a stream of data in csv files in designated directory with optional multi-processing.
- stream_searcher.certain_kind_tap(data_items)[source]¶
As the stream of data items go by, get different kinds of information from them, in this case, the things that are fruit and metal, collecting each kind with a different spigot.
stream_tap doesn’t consume the data_items iterator by itself, it’s a generator and must be consumed by something else. In this case, it’s consuming the items by casting the iterator to a tuple, but doing it in batches.
Since each batch is not referenced by anything the memory can be freed by the garbage collector, so no matter the size of the data_items, only a little memory is needed. The only things retained are the results, which should just be a subset of the items and in this case, the getter functions only return a portion of each item it matches.
Parameters: data_items – A sequence of unicode strings
- stream_searcher.get_fruit(item)[source]¶
Get things that are fruit.
Returns: thing of item if it’s a fruit
- stream_searcher.get_metal(item)[source]¶
Get things that are metal.
Returns: thing of item if it’s metal