Mike,
Sounds like you have two problems. One getting the data. The second is getting the data combined by geographical point. The first part of the problem seems to suggest that you won't get a lot of parallelism unless you're working with a very large square area. The second part of the problem deals with taking all of the data and processing it. Its the second part of the problem that may make it more of an interesting m/r problem. -Mike > From: [email protected] > To: [email protected] > Date: Tue, 28 Dec 2010 11:11:50 -0800 > Subject: Realtime Query with hbase (aggregation) > > I¹m trying to do some realtime queries in HBASE over about 550 million > points of data (~150K rows, ~3700 columns/row), and need some help/ideas on > speeding it up. > > I have a dataset in HBASE that consist of the following: > > Row a geohashed lat/lon value (there are 150k unique row keys) > > Column family 1: science data There are a bunch of columns in this family, > notable a year/day-of-year (doy) pair from 1999260 to 2009366 (so about 3700 > columns per family) > > It looks like this: > > Key |1999266|1999267|1999268| ... |2009365 | 2009366 > -------------|-------|-------|-------|-------|--------|----------| > ljkashd781aj | -14 | -16 | -17 | ..... | -17.755| -17.5 | > ljkavhd79ads | -14 | -17 | -17 | ..... | -16.785| -16.665 | > ljkcshd84q2d | -15 | -16 | -17.35| ..... | -15.445| -12.34 | > > > So if I wanted to look for the all-time data for a single point (let¹s say > lat/lon 75,-135) I¹d geohash that point and do a get on it (we¹re assuming > the point exists). > > I¹d get back the science column family which has ~3700 points (10 years * > 365 days), and can plot this or do whatever I need just fine. > > The issue happens when I want to get the data over a larger region- I can > select the region simply enough, I simply do a scan from the start point to > then end point (a bounding box of lat/long values, where geohashes inside > that box are in lexicographic order, this is all straight forward for me). > The issue I have is aggregating the results and returning the average result > for a year/doy pair. > > Currently I set up a scanner to go through each row, and for each key I > parse the yeardoy/value and add this to a hash map- the yeardoy is a key > into it, and then I have an object with sum and count (incremented each time > I add to the sum, no more than +1 per row) fields. This lets me sum up the > values and find the average by dividing the sum by the count and having the > average value for a certain year/doy. But this takes a long time for larger > regions (>1 minute, often times 2-3 minutes), and I'm trying to figure out > the best way to do this- > > I currently have a 10 node hbase setup (c1.xlarge on AWS). LZO compression > and caching the scan results to a certain extent (I've played with this but > don't think it's speeding me up or slowing me down too much) I'm wondering > if I should add more machines, or if there is a better way to get the > results I'm looking for. > > I'm trying to do the aggregation in realtime, but understand I can submit > jobs and return the results to a user at a later time if needs be. > > Sorry for the long email, just want to explain the problem clearly. > > -Mike >
