[sqlalchemy] Brewery: Heterogenous data streams with SQL Alchemy

Stefan Urbanek Fri, 21 Jan 2011 03:15:30 -0800

Hi,

I am working on a framework called Brewery. Goal is to provide
abstract interface for data streams from heterogenous sources into
heterogenous targets. More information with images:


http://databrewery.org/doc/streams.html

Point is to have objects similar to file streams, but streaming
structured data in form of records/rows instead of bytes.

STREAMS

Currently implemented sources/targets are:

* Relational database table through SQLAlchemy (source+target)
* CSV file (source+target)
* XLS file (source only)
* MongoDB (source+target)
* google spreadsheet (source only)
* directory with YAML files - one file per record (source+target)

For each source there are three basic methods:

- fields - list of fields provided by the source (has to be explicitly
set for sources with unknown fields)
- rows() - iterator for data represented by list
- records() - iterator for data represented by dict object

Optionally you can use: read_fields(limit) to learn what fields are
present in data source (for example in mongo DB)

For each target:

- append() - append an object, either a dictionary or a list to the
target

With this simple interface you can easily create pipes between MongoDB
and Postgres, import directory of YAML files into MySQL, ...

DATA QUALITY

In addition to that, there is simple data auditing tool for basic data
quality audit. You can use StreamAuditor (stream target) to collect
information about data and then generate data quality report.
Currently audited data properties are:

* record and value count (might be different in document based
DBs,same in relational)
* null count
* empty string count
* distinct value count
* distinct values
* storage types (only one for relational databases)
* ratios of measured properties, such as null/value count or null/
record count

More probes to come (in a modular way).

API is documented here:

http://databrewery.org/doc/api/index.html

Sources:

bitbucket: https://bitbucket.org/Stiivi/brewery (main - mercurial
repository)
github: https://github.com/Stiivi/brewery/ (synchronized with main)

Example usage: Some source streams (XLS/CSV) are already being used
for data proxy in project CKAN for converting data from various
resources into common structured form:

    http://blog.ckan.org/2011/01/11/raw-data-in-ckan-resources-and-data-proxy/

FUTURE

Plans for the future are:

* command-line tools for simple data streaming tasks: copy, quality
audit
* data processing stream network with nodes for simple
transformations, analysis and data mining
* modular data quality probes - injectable into the network

The Brewery project is in early stage. I would like have some
feedback: what do you think about it? Do you have any suggestions,
comments? If anyone would like to try it and will have any troubles,
just drop me a line and I will help.

Regards,

Stefan Urbanek
--
Twitter: @Stiivi

-- 
You received this message because you are subscribed to the Google Groups 
"sqlalchemy" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sqlalchemy?hl=en.

[sqlalchemy] Brewery: Heterogenous data streams with SQL Alchemy

Reply via email to