data structure

Andre Reiter Thu, 14 Jul 2011 12:52:51 -0700

Hi everybody,

we have our hadoop + hbase cluster running at the moment with 6 servers


everything is working just fine. We have a web application, where data is stored with the 
row key = user id (meaningless UUID). So our users have a cookie, which is the row key, 
behind this key are families with items, i.e. family "impressions", where every 
impression is stored with its time stamp etc...

the row key is defined with the user id, to make the real time request 
possible, so we can retrieve all user data very fast

new we are running mapreduce jobs, to generate a report: for example we want to 
know how many impressions were done by all users in last x days. therefore the 
scan of the MR job is running over all data in our hbase table for the 
particular family. this takes at the moment about 70 seconds, which is actually 
a bit too long, and with the data growing, the time will increase, unless we 
add new workers to the cluster. we have right now 22 regions

the problem i see, is that we can not define a filter for the scan, the row key 
(user id) is just an UUID, nothing meaningfull in it

what can we do, to however improve (accelerate) the scan process? is it maybe 
advisable to store the data more redundant. so for example we create second 
table and store every impression twice, one time with the user id as row key in 
the first table, and the second one with a time stamp as a row key in the 
second table.
the data volume would grow twice as fast, but our scans will work x times 
faster on the second table compared to now

comments are very appreciated

andre

data structure

Reply via email to