Have you looked at the phoenix project? http://phoenix.apache.org/ It is an SQL layer on top of hbase, an alternative to map-reduce jobs when ad hoc/realtime queries are needed. Your use case seems like it would work fairly well with phoenix.
Alok On Mon, Jan 12, 2015 at 7:06 PM, Wilm Schumacher <[email protected]> wrote: > Hi, > > I'm doing something comparable right now, but not with such a HUGE > database O_o. 10 Mio results for such a query? This would mean that you > have 100 Mio -> 1 Billion customers ?!?! > > However: in my opinion with such a huge database HBase is a good fit. > However, your data model should be changed for that purpose. > > You could use something like > <userid>-<item_id> (or the other way around) > for a row key and > data:clicktime => <timestamp> > or whatever for data .... > > for the "item table". In this example the "e-mail" table does not have > to be touched (I assume they are separated) > > 1.) Map reduce > > Your problem make a good point for map reduce. Just map over the e-mail > events, when you find an e-mail that fits your question, make a get on > the user which is connected to the e-mail and the item_id (row key > design above). If so => emit and use an hbase table as sink. By that you > would make sure that the user is only processed once (as the e-mail > could be opened twice or so) and you would have a nice list in a > separate table which you could use. > > For such a huge database the question of 30 min or less then is "just" a > question of money for hbase/yarn nodes. > > 2.) Simple index > > I thought of simple indexes, too. However, this only would work properly > if you would make a table where the item_id would be the primary key for > the index and the users browse events would make up the column family. > > with <ts> the specific timestamp of the action something like > item_1 user:alice <ts> user:bob <ts> > item_2 user:alice <ts> user:charles <ts> > > (hope the e-mail is rendered okay to explain what I mean) > > Then you could simply fetch the <item_id> row and use that to check if > the e-mail action exist. > > However, this index would be a table of <number of items> rows and > <number of users> qualifiers in one column family. Of course it would be > sparse, but would make up by above estimate 100 Mio -> 1 Billion > qualifiers in the main column family. I never ran a table THAT wide. > Does anyone on this list did something like that and can comment? > > Furthermore you would have to iterate over the result ... which has to > be estimated >> 10 Mio items, which sounds like a pretty bad idea. > > 3.) Fancy index > > You make up an index for exactly your question from above. > > Row keys would be > <e-mail-action>-<e-mail_id>-<item_id>-<userid> > where <e-mail-action> is "click", "erase", .... > > with data > email:1 => <rowkey of the email event> > email:2 => <rowkey of the email event> > ... > (as some events could occure more than once) > > and > item:1 => <rowkey of the item browse event> > item:2 => <rowkey of the item browse event> > ... > > Then you would have your list right away e.g. by setting up a scan for > the specific range (which is the answer to your specific question in > this case). > > But you would have to make on every e-mail event a scan over the list of > items the user has browsed (by design above a simple scan in the "item > table" for the specific user) and fill the index. > > Which actually sounds like an even worse idea. > > ==== > > But as I read your e-mail, this example is just one of the query of many > different queries, right? And for every query you need one such index. > > Normally I'm in favour of indexes, but on this case I think you really > need a clever row key design of your separate tables and the full power > of map reduce. > > However, I saw your analogue postings on the other "nosql" mailing > lists. As cassandra have a "join" feature in the cql this would be a > good fit ... at first sight. Furthermore the querying for ElasticSearch > can easily be used in a way which satisfy your need. However, by design > (as I understand what c* is doing on joins) cassandra would do A LOT of > communication for that case (i.e. inefficient), and elastic search will > probably break down if the master would have to give you 10 Mio results > back. Thus you would have to iterate A LOT of pages to get your full > result, which means a lot of JSON parsing and http stuff. > > Even if hbase is not the best choice on the first sight, I think it > could be the best choice nevertheless. > > Best wishes, > > Wilm > > ps: man ... that's a lot of data. Is that you, amazon? ;) > > Am 13.01.2015 um 01:42 schrieb Chen Wang: >> Hey Guys, >> I am seeking advice on design a system that maintains a historical view of >> a user's activities in past one year. Each user can have different >> activities: email_open, email_click, item_view, add_to_cart, purchase etc. >> The query I would like to do is, for example, >> >> Find all customers who browse item A in the past 6 month, and also clicked >> an email. >> and I would like the query to be done in reasonable time frame. (for >> example, within 30 minutes to retrieve 10million such users) >> >> Since we already have HBase cluster in place, HBase becomes my first >> choice. So I can have customer_id as the row key, column family be >> 'Activity', then have certain attributes associated with the column >> family,something like: >> >> custer_id, browse:{item_id:12334, timestamp:epoc} >> >> However, It seems that HBase would not be a good choice for supporting the >> queries above. Even its possible with scan, it will be super inefficient >> due to the size of the data set. >> >> Is my understanding correct and I should resort to other data store.(ES in >> my opinion). or has anyone done similar thing with HBase? >> >> Thanks in advance. >> Chen >> >
