Re: aggregation performance

James Taylor Thu, 03 May 2012 10:02:44 -0700

We're seen reasonable performance, with the caveat that you need toparallelize the scan doing the aggregation. In our benchmarking, we havethe client scan each region in parallel and have a coprocessor aggregatethe row count and return a single row back (with the client thentotaling the counts it gets back). Here are the numbers we've seen whenaggregating 1 million rows, this with a slightly older hbase version(~0.92):


Schema: 50col x 50bytes with compressible data
Regions     RowCount         RowCount with single binary filter
            Time (sec)       Time (sec)
  1          11.3             19.0
  4           3.5              5.6
 16           1.8              2.6
 32           1.2              1.8


Schema: 1col x 2500bytes with compressible data
Regions     RowCount         RowCount with single binary filter
            Time (sec)       Time (sec)
  1           7.0              7.0
  4           1.2              1.2
 16           0.7              0.7
 32           0.3              0.3

This is run on a four machine cluster with each machine having 4G Heapand with the servers warmed-up (cached data).


Hope this helps.

    James


On 05/03/2012 08:01 AM, Tom Brown wrote:

For our solution we are doing some aggregation on the server via
coprocessors. In general, for each row there are 8 columns: 7 columns
that contain numbers (for summation) and 1 column that contains a
hyperloglog counter (about 700bytes). Functionally, this solution
works well and ought to scale with the number of region servers.
However, the individual request performance leaves a little to be
desired. What we've seen is that to scan 40000 rows (aggregated into
3000 rows) takes about 4 seconds.

Our code is in it's early stages (unoptimized) so we hope to see some
significant performance improvements when we run our coprocessor under
a profiler. Our benchmarks were on underpowered machines (only 2gb
RAM) as well.

Hope this helps!

--Tom

On Thu, May 3, 2012 at 6:08 AM, Pere Ferrera<[email protected]>  wrote:

Hi,

Is anybody benchmarking the performance of server-side aggregations through
co-processors in HBase? I am interested to know if HBase could potentially
be used to calculate real-time SQL-like aggregations at a good level of
performance (q<  200ms on high-load, big dataset scenario). Just curious to
know before I implement my own benchmarks.

Pere.

Re: aggregation performance

Reply via email to