If you need to do aggregates over substantial volumes of data, it will always 
take time. Scans for sequential rows are typically IO bound, so faster IO could 
probably make you go a bit faster, but on AWS I guess that's difficult.

If you really want to speed things up, you need to find a way to reduce the 
data volume without losing information. This usually requires preprocessing of 
some kind. You could store a pre-calculated average or count+sum pair per row 
in a separate family or table. This way, you'd only have to scan the averages 
instead of all the data. Getting sub-second answers to questions that require a 
lot of source data usually means storing derived data sets redundantly.

If you never or rarely have missing data points, you could also store a series 
of data points in one value. So let the column qualifier just be a year or 
month and then store a list of 366 data points in the value. This would allow 
you to go through it more quickly. You'd just need a magic value for missing 
data points, such that you can take care of those in software. We do something 
similar using protobuf as storage format.

Point is that calculating averages is not computationally heavy and you're 
scanning through data sequentially so the main bottleneck will be the amount of 
data that needs to come of disk, since there are not a lot of seeks happening 
and CPU is not working very hard.

If the usage pattern requires the same data points to be queried often (if 
there is a hot spot in the data), you could make things snappier by throwing 
more block cache at it.



Hope this helps,
Friso




On 28 dec 2010, at 20:11, Gangl, Michael E (388K) wrote:

> I¹m trying to do some realtime queries in HBASE over about 550 million
> points of data (~150K rows, ~3700 columns/row), and need some help/ideas on
> speeding it up.
> 
> I have a dataset in HBASE that consist of the following:
> 
> Row  a geohashed lat/lon value (there are 150k unique row keys)
> 
> Column family 1: science data  There are a bunch of columns in this family,
> notable a year/day-of-year (doy) pair from 1999260 to 2009366 (so about 3700
> columns per family)
> 
> It looks like this:
> 
> Key          |1999266|1999267|1999268|  ...  |2009365 | 2009366
> -------------|-------|-------|-------|-------|--------|----------|
> ljkashd781aj |  -14  | -16   | -17   | ..... | -17.755| -17.5    |
> ljkavhd79ads |  -14  | -17   | -17   | ..... | -16.785| -16.665  |
> ljkcshd84q2d |  -15  | -16   | -17.35| ..... | -15.445| -12.34   |
> 
> 
> So if I wanted to look for the all-time data for a single point (let¹s say
> lat/lon 75,-135) I¹d geohash that point and do a get on it (we¹re assuming
> the point exists).
> 
> I¹d get back the science column family which has ~3700 points (10 years *
> 365 days), and can plot this or do whatever I need just fine.
> 
> The issue happens when I want to get the data over a larger region- I can
> select the region simply enough, I simply do a scan from the start point to
> then end point (a bounding box of lat/long values, where geohashes inside
> that box are in lexicographic order, this is all straight forward for me).
> The issue I have is aggregating the results and returning the average result
> for a year/doy pair.
> 
> Currently I set up a scanner to go through each row, and for each key I
> parse the yeardoy/value and add this to a hash map- the yeardoy is a key
> into it, and then I have an object with sum and count (incremented each time
> I add to the sum, no more than +1 per row) fields. This lets me sum up the
> values and find the average by dividing the sum by the count and having the
> average value for a certain year/doy. But this takes a long time for larger
> regions (>1 minute, often times 2-3 minutes), and I'm trying to figure out
> the best way to do this-
> 
> I currently have a 10 node hbase setup (c1.xlarge on AWS). LZO compression
> and caching the scan results to a certain extent (I've played with this but
> don't think it's speeding me up or slowing me down too much) I'm wondering
> if I should add more machines, or if there is a better way to get the
> results I'm looking for.
> 
> I'm trying to do the aggregation in realtime, but understand I can submit
> jobs and return the results to a user at a later time if needs be.
> 
> Sorry for the long email, just want to explain the problem clearly.
> 
> -Mike
> 

Reply via email to