If you need to do aggregates over substantial volumes of data, it will always take time. Scans for sequential rows are typically IO bound, so faster IO could probably make you go a bit faster, but on AWS I guess that's difficult.
If you really want to speed things up, you need to find a way to reduce the data volume without losing information. This usually requires preprocessing of some kind. You could store a pre-calculated average or count+sum pair per row in a separate family or table. This way, you'd only have to scan the averages instead of all the data. Getting sub-second answers to questions that require a lot of source data usually means storing derived data sets redundantly. If you never or rarely have missing data points, you could also store a series of data points in one value. So let the column qualifier just be a year or month and then store a list of 366 data points in the value. This would allow you to go through it more quickly. You'd just need a magic value for missing data points, such that you can take care of those in software. We do something similar using protobuf as storage format. Point is that calculating averages is not computationally heavy and you're scanning through data sequentially so the main bottleneck will be the amount of data that needs to come of disk, since there are not a lot of seeks happening and CPU is not working very hard. If the usage pattern requires the same data points to be queried often (if there is a hot spot in the data), you could make things snappier by throwing more block cache at it. Hope this helps, Friso On 28 dec 2010, at 20:11, Gangl, Michael E (388K) wrote: > I¹m trying to do some realtime queries in HBASE over about 550 million > points of data (~150K rows, ~3700 columns/row), and need some help/ideas on > speeding it up. > > I have a dataset in HBASE that consist of the following: > > Row a geohashed lat/lon value (there are 150k unique row keys) > > Column family 1: science data There are a bunch of columns in this family, > notable a year/day-of-year (doy) pair from 1999260 to 2009366 (so about 3700 > columns per family) > > It looks like this: > > Key |1999266|1999267|1999268| ... |2009365 | 2009366 > -------------|-------|-------|-------|-------|--------|----------| > ljkashd781aj | -14 | -16 | -17 | ..... | -17.755| -17.5 | > ljkavhd79ads | -14 | -17 | -17 | ..... | -16.785| -16.665 | > ljkcshd84q2d | -15 | -16 | -17.35| ..... | -15.445| -12.34 | > > > So if I wanted to look for the all-time data for a single point (let¹s say > lat/lon 75,-135) I¹d geohash that point and do a get on it (we¹re assuming > the point exists). > > I¹d get back the science column family which has ~3700 points (10 years * > 365 days), and can plot this or do whatever I need just fine. > > The issue happens when I want to get the data over a larger region- I can > select the region simply enough, I simply do a scan from the start point to > then end point (a bounding box of lat/long values, where geohashes inside > that box are in lexicographic order, this is all straight forward for me). > The issue I have is aggregating the results and returning the average result > for a year/doy pair. > > Currently I set up a scanner to go through each row, and for each key I > parse the yeardoy/value and add this to a hash map- the yeardoy is a key > into it, and then I have an object with sum and count (incremented each time > I add to the sum, no more than +1 per row) fields. This lets me sum up the > values and find the average by dividing the sum by the count and having the > average value for a certain year/doy. But this takes a long time for larger > regions (>1 minute, often times 2-3 minutes), and I'm trying to figure out > the best way to do this- > > I currently have a 10 node hbase setup (c1.xlarge on AWS). LZO compression > and caching the scan results to a certain extent (I've played with this but > don't think it's speeding me up or slowing me down too much) I'm wondering > if I should add more machines, or if there is a better way to get the > results I'm looking for. > > I'm trying to do the aggregation in realtime, but understand I can submit > jobs and return the results to a user at a later time if needs be. > > Sorry for the long email, just want to explain the problem clearly. > > -Mike >
