With an entity-centric data model (i.e. customer_id as row key), you're looking at a full table scan for every query. 30-minute SLA puts you well within the realm of a MapReduce/Cascading/Pig/Hive/Tez/Spark job. HBase can work fine for this, but since you're not really in the low-latency world, perhaps you'd consider a more analytical storage system (i.e., HDFS + ORC/Parquet). Of course, if your data is extremely sparse, you'll land back here at HBase.
You can achieve lower latencies with HBase by pushing query components into the row key. However, if the queries are truly adhoc, you'll probably want secondary indices. Apache Phoenix is a great choice if you decide to pursue this route. ES may also be a reasonable choice here, but it depends on many other factors, including 'scale' and your philosophy about indices as a data storage medium. If time is a frequent component of your query patterns, I recommend you model is directly in your schema. You'll have more flexibility and better performance than if you rely on HBase's timestamp for this attribute. -n On Mon, Jan 12, 2015 at 4:42 PM, Chen Wang <[email protected]> wrote: > Hey Guys, > I am seeking advice on design a system that maintains a historical view of > a user's activities in past one year. Each user can have different > activities: email_open, email_click, item_view, add_to_cart, purchase etc. > The query I would like to do is, for example, > > Find all customers who browse item A in the past 6 month, and also clicked > an email. > and I would like the query to be done in reasonable time frame. (for > example, within 30 minutes to retrieve 10million such users) > > Since we already have HBase cluster in place, HBase becomes my first > choice. So I can have customer_id as the row key, column family be > 'Activity', then have certain attributes associated with the column > family,something like: > > custer_id, browse:{item_id:12334, timestamp:epoc} > > However, It seems that HBase would not be a good choice for supporting the > queries above. Even its possible with scan, it will be super inefficient > due to the size of the data set. > > Is my understanding correct and I should resort to other data store.(ES in > my opinion). or has anyone done similar thing with HBase? > > Thanks in advance. > Chen >
