Hi there- re: "Based on what I have read it looks like HBase is really good for scans or row key lookup."
Yes. It is also a good MR source and sink. re: "re: how I can do joins" You either need to denormalize it on the way in to Hbase or do a lookup. re: "Will that also be fast?" Hbase has a block cache for frequently accessed data, and very recent HDFS improvements (e.g., CH3u3) are a big improvement for random lookups and HDFS throughput in general. But for your needs this is like asking "how long is a piece of string?" You are going to need to try some prototypes and see if it works for your particular situation. re: "sum, min, max" As for sum, see MapReduce. For the rest, search the threads on the dist-list there has been a conversation on that recently. On 1/21/12 2:32 AM, "Amit Gupta" <[email protected]> wrote: >I am not sure how I can do joins using HBase which is essentially what I >am >trying to do. Based on what I have read it looks >like HBase is really good for scans or row key lookup. Please correct me >if >I am wrong. > >I can have a HBase table for users with {userid + timestamp} as the >rowkey. >Using this lookup for a single user for given time >range will be fast. However I need to do lookups for millions of users for >different time range. Will that also be fast ? > >Also lookups are not the only thing that I am trying to do. I need to >compute statistics like sum, min, max etc for each data >point for a user. How can I do that efficiently using Hbase ? > > >On Fri, Jan 20, 2012 at 2:20 PM, T Vinod Gupta ><[email protected]>wrote: > >> from the little i have used hbase for, it is really good for the below >>use >> case you mentioned. hbase takes care of scale and you can use map >>reduce to >> do the kind of task you mentioned below. >> but please remember that it is super important how you design the >>schema. >> the schema should allow for your use case and allow for an efficient map >> reduce. >> if you decide with hbase, read the hbase book before deployment or >>schema >> design/implementation. >> thanks >> >> On Fri, Jan 20, 2012 at 2:10 PM, Amit Gupta <[email protected]> wrote: >> >> > Hi, >> > >> > >> > >> > I am trying to figure out if Hbase is the right candidate for my use >>case >> > which is as follows : >> > >> > >> > >> > I have a users table containing millions users and for each user I >>have a >> > bunch of data points for each day in past >> > >> > 2 years. Some of these data points are number of clicks in different >> parts >> > of a web page, total # of clicks, total >> > >> > searches, # of unique searches etc. So the data is in this form : >> > >> > >> > >> > User Id >> > >> > Date >> > >> > X1 (Total Clicks) >> > >> > X2 (Total Searches) >> > >> > X3 >> > >> > Š.. >> > >> > Xn >> > >> > 1 >> > >> > D1-730 >> > >> > 4 >> > >> > 0.8 >> > >> > >> > >> > >> > >> > 90 >> > >> > 1 >> > >> > D1-729 >> > >> > 2 >> > >> > 0.5 >> > >> > >> > >> > >> > >> > 50 >> > >> > Š >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > 1 >> > >> > D1 >> > >> > 30 >> > >> > 0.9 >> > >> > >> > >> > >> > >> > 20 >> > >> > 2 >> > >> > D1-730 >> > >> > 23 >> > >> > 1.2 >> > >> > >> > >> > >> > >> > 85 >> > >> > 2 >> > >> > D1-729 >> > >> > 56 >> > >> > 2.3 >> > >> > >> > >> > >> > >> > 56 >> > >> > Š. >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > My application has the following predominant query pattern - For a >>subset >> > of users (subset being quite large in order of 1 -5 mil), I want to do >> sum, >> > min, max, mean, standard deviation of data points for different date >> ranges >> > for the users. So for eg user1 may have a start and end date of {sd1, >> ed1}, >> > user2 may have {sd2, ed2} and so on. I want to compute sum, min, max >>etc >> > for data points X1, X2, Š Xn over date ranges {sd1, ed1}, {sd2, ed2} , >> > {sd3, ed3} for each user in the subset . >> > >> > >> > >> > Currently we do this in db by creating a table for subset of the users >> with >> > their start and end day and joining against the users tables. The >>query >> > however is extremely slow and takes hours to execute. >> > >> > >> > >> > I am trying to figure out the following : >> > >> > 1. Can I do the above query efficiently (I want to reduce the query >> > time. Space is not that big of a concern for me) using Hbase ? >> > >> > >> > 1. Can someone please give me alternative solutions if Hbase is not >>the >> > right solution for such a use case ? >> > >> > >> > >> > Thanks, >> > >> > dlg >> > >>
