Hello, We (Inadco) are looking for users and developers to engage in our open project code named, for the lack of better name, "hbase-lattice" in order to mutually benefit and eventually develop a mature Hbase-based BI real time OLAP-ish solution.
The basic premise is to use Cuboid Lattice -like model (hence the name) to precompute cuboids manually specified (as opposed to statistically inferred) and then build query optimization based on those. (similarly to how DBA analyzes query use cases and decides which indexes to build where). On top of it, we threw in both declarative api and query language support (our reporting system and realtime analytics in platform is actually using query language, to facilitate our developers there). We aim to answer aggregate queries over cube of facts with very short hbase scans over the same projection table so analytical reply could be produced in under 1 ms+ network overhead. Another goal is short latency for fact availability. So we employ incremental MR fact compilation (~ 3-5 mins since fact event average, depending on the # of cuboids and cluster size). At this point we have this project in production with minimum capabilities we need, in kind of pilot mode still, but it is to replace our previous manual incremental projection work on a permanent basis very soon (as soon as we finish migrating our reporting). There's a long todo list and certainly any given particular 3rd party would likely find voids it would have liked to be filled in. Right now we have integration only with Jasper reports tool but eventually it shouldn't be too difficult to wrap the client into jdbc contract as well and enable practically any use (it's just since we integrate tightly both reporting and RT platforms we don't have a need for jdbc per se at the moment). The project readme document is here: https://github.com/dlyubimov/HBase-Lattice/blob/dev-0.1.x/docs/readme.pdf?raw=true or here https://github.com/inadco/HBase-Lattice/blob/dev-0.1.x/docs/readme.pdf?raw=true Aside from capabilities mentioned in this document, we also support custom aggregate functions which we can plug directly into model definiton. (so we can develop custom aggregate function set rather quickly and easily, as it stands). The cube update is incremental with a two-step (sequentially) generated pig job (so perhaps compiler cycle could be ~5 minutes since actual event). We process and aggregate our impression and click and other fact streams with it. The model can be updated with some backward compatibility conventions similar to protobuf conventions (as in add stuff, but don't change type) and the changes could be pushed into production system to take effect immediately henceforth without any need to redeploy any code. Operationally, i tested it for 1.2 bln/day rather wide event fact streams packed as protobuf messages inside sequence files on 6 nodes, and the compilation was not even breaking a sweat. Obviously, my data highly aggregates over time dimensions which correspond to time of the event, so hbase update load actually is pretty light due to high degree of aggregation. But the biggest benefit is that one can scale horizontally the number of facts handled per unit of time pretty impressively. This is optimized for time series data for the most part, so consequently one will see very limited and time oriented support for dimension types and hierarchies at the moment. Generally i think the need is very common for BI solutions over big data time series such as impression or request logs but surprisingly I did not find a well-maintained hbase solution for that (although i did see either stale or less capable attempts out there -- i certainly have missed stuff floating around), so hence. We are planning to maintain the project for a long time as a part of our production system. Please email me if there's interest as either user or collaborator. I think i saw a couple of emails on this list looking for a solution for a similar problem. This is partly inspired by and intended as a complementary solution to Tsuna's Open TSDB (so big thanks to StumbleUpon's people for an example of how to handle time series data). Thanks. -Dmitriy
