I¹m trying to do some realtime queries in HBASE over about 550 million points of data (~150K rows, ~3700 columns/row), and need some help/ideas on speeding it up.
I have a dataset in HBASE that consist of the following: Row a geohashed lat/lon value (there are 150k unique row keys) Column family 1: science data There are a bunch of columns in this family, notable a year/day-of-year (doy) pair from 1999260 to 2009366 (so about 3700 columns per family) It looks like this: Key |1999266|1999267|1999268| ... |2009365 | 2009366 -------------|-------|-------|-------|-------|--------|----------| ljkashd781aj | -14 | -16 | -17 | ..... | -17.755| -17.5 | ljkavhd79ads | -14 | -17 | -17 | ..... | -16.785| -16.665 | ljkcshd84q2d | -15 | -16 | -17.35| ..... | -15.445| -12.34 | So if I wanted to look for the all-time data for a single point (let¹s say lat/lon 75,-135) I¹d geohash that point and do a get on it (we¹re assuming the point exists). I¹d get back the science column family which has ~3700 points (10 years * 365 days), and can plot this or do whatever I need just fine. The issue happens when I want to get the data over a larger region- I can select the region simply enough, I simply do a scan from the start point to then end point (a bounding box of lat/long values, where geohashes inside that box are in lexicographic order, this is all straight forward for me). The issue I have is aggregating the results and returning the average result for a year/doy pair. Currently I set up a scanner to go through each row, and for each key I parse the yeardoy/value and add this to a hash map- the yeardoy is a key into it, and then I have an object with sum and count (incremented each time I add to the sum, no more than +1 per row) fields. This lets me sum up the values and find the average by dividing the sum by the count and having the average value for a certain year/doy. But this takes a long time for larger regions (>1 minute, often times 2-3 minutes), and I'm trying to figure out the best way to do this- I currently have a 10 node hbase setup (c1.xlarge on AWS). LZO compression and caching the scan results to a certain extent (I've played with this but don't think it's speeding me up or slowing me down too much) I'm wondering if I should add more machines, or if there is a better way to get the results I'm looking for. I'm trying to do the aggregation in realtime, but understand I can submit jobs and return the results to a user at a later time if needs be. Sorry for the long email, just want to explain the problem clearly. -Mike
