I did some experiments which compares scan, coprocessor and mapreduce approach, in an ec2 environment. You may find it interesting: http://hbase-coprocessor-experiments.blogspot.com/2011/05/extending.html
Thanks, Himanshu On Thu, May 3, 2012 at 11:02 AM, James Taylor <[email protected]> wrote: > We're seen reasonable performance, with the caveat that you need to > parallelize the scan doing the aggregation. In our benchmarking, we have the > client scan each region in parallel and have a coprocessor aggregate the row > count and return a single row back (with the client then totaling the counts > it gets back). Here are the numbers we've seen when aggregating 1 million > rows, this with a slightly older hbase version (~0.92): > > Schema: 50col x 50bytes with compressible data > Regions RowCount RowCount with single binary filter > Time (sec) Time (sec) > 1 11.3 19.0 > 4 3.5 5.6 > 16 1.8 2.6 > 32 1.2 1.8 > > Schema: 1col x 2500bytes with compressible data > Regions RowCount RowCount with single binary filter > Time (sec) Time (sec) > 1 7.0 7.0 > 4 1.2 1.2 > 16 0.7 0.7 > 32 0.3 0.3 > > This is run on a four machine cluster with each machine having 4G Heap and > with the servers warmed-up (cached data). > > Hope this helps. > > James > > > > On 05/03/2012 08:01 AM, Tom Brown wrote: >> >> For our solution we are doing some aggregation on the server via >> coprocessors. In general, for each row there are 8 columns: 7 columns >> that contain numbers (for summation) and 1 column that contains a >> hyperloglog counter (about 700bytes). Functionally, this solution >> works well and ought to scale with the number of region servers. >> However, the individual request performance leaves a little to be >> desired. What we've seen is that to scan 40000 rows (aggregated into >> 3000 rows) takes about 4 seconds. >> >> Our code is in it's early stages (unoptimized) so we hope to see some >> significant performance improvements when we run our coprocessor under >> a profiler. Our benchmarks were on underpowered machines (only 2gb >> RAM) as well. >> >> Hope this helps! >> >> --Tom >> >> On Thu, May 3, 2012 at 6:08 AM, Pere Ferrera<[email protected]> >> wrote: >>> >>> Hi, >>> >>> Is anybody benchmarking the performance of server-side aggregations >>> through >>> co-processors in HBase? I am interested to know if HBase could >>> potentially >>> be used to calculate real-time SQL-like aggregations at a good level of >>> performance (q< 200ms on high-load, big dataset scenario). Just curious >>> to >>> know before I implement my own benchmarks. >>> >>> Pere. > >
