Hi, I am working on a framework called Brewery. Goal is to provide abstract interface for data streams from heterogenous sources into heterogenous targets. More information with images:
http://databrewery.org/doc/streams.html Point is to have objects similar to file streams, but streaming structured data in form of records/rows instead of bytes. STREAMS Currently implemented sources/targets are: * Relational database table through SQLAlchemy (source+target) * CSV file (source+target) * XLS file (source only) * MongoDB (source+target) * google spreadsheet (source only) * directory with YAML files - one file per record (source+target) For each source there are three basic methods: - fields - list of fields provided by the source (has to be explicitly set for sources with unknown fields) - rows() - iterator for data represented by list - records() - iterator for data represented by dict object Optionally you can use: read_fields(limit) to learn what fields are present in data source (for example in mongo DB) For each target: - append() - append an object, either a dictionary or a list to the target With this simple interface you can easily create pipes between MongoDB and Postgres, import directory of YAML files into MySQL, ... DATA QUALITY In addition to that, there is simple data auditing tool for basic data quality audit. You can use StreamAuditor (stream target) to collect information about data and then generate data quality report. Currently audited data properties are: * record and value count (might be different in document based DBs,same in relational) * null count * empty string count * distinct value count * distinct values * storage types (only one for relational databases) * ratios of measured properties, such as null/value count or null/ record count More probes to come (in a modular way). API is documented here: http://databrewery.org/doc/api/index.html Sources: bitbucket: https://bitbucket.org/Stiivi/brewery (main - mercurial repository) github: https://github.com/Stiivi/brewery/ (synchronized with main) Example usage: Some source streams (XLS/CSV) are already being used for data proxy in project CKAN for converting data from various resources into common structured form: http://blog.ckan.org/2011/01/11/raw-data-in-ckan-resources-and-data-proxy/ FUTURE Plans for the future are: * command-line tools for simple data streaming tasks: copy, quality audit * data processing stream network with nodes for simple transformations, analysis and data mining * modular data quality probes - injectable into the network The Brewery project is in early stage. I would like have some feedback: what do you think about it? Do you have any suggestions, comments? If anyone would like to try it and will have any troubles, just drop me a line and I will help. Regards, Stefan Urbanek -- Twitter: @Stiivi -- You received this message because you are subscribed to the Google Groups "sqlalchemy" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en.
