Re: Speeding up the row count

Vedad Kirlic Wed, 17 Apr 2013 11:52:55 -0700

Hi Omkar,

If you are not interested in occurrences of specific column (e.g. name,
email ... ), and just want to get total number of rows (regardless of their
content - i.e. columns), you should avoid adding any columns to the Scan, in
which case coprocessor implementation for AggregateClient, will add
FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so
this should result in some speed up.


This is similar approach to what hbase shell 'count' implementation does,
although reduction in overhead in that case is bigger, since data transfer
from region server to client (shell) is minimized, whereas in case of
coprocessor, data does not leave region server, so most of the improvement
in that case should come from avoiding loading of unnecessary files. Not
sure how this will apply to your particular case, given that data set per
row seems to be rather small. Also, in case of AggregateClient you will
benefit if/when your tables span multiple regions. Essentially, performance
of this approach will 'degrade' as your table gets bigger, but only to the
point when it splits, from which point it should be pretty constant. Having
this in mind, and your type of data, you might consider pre-splitting your
tables.

DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase
internals :), so your best bet is to try it - I'm too lazy to verify impact
my self ;)

Finally, if your case can tolerate eventual consistency of counters with
actual number of rows, you can, as already suggested, have RowCounter map
reduce run every once in a while, write the counter(s) back to hbase, and
read those when you need to obtain the number of rows.

Regards,
Vedad



--
View this message in context: 
http://apache-hbase.679495.n3.nabble.com/Speeding-up-the-row-count-tp4042378p4042415.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: Speeding up the row count

Reply via email to