Hi Omkar, If you are not interested in occurrences of specific column (e.g. name, email ... ), and just want to get total number of rows (regardless of their content - i.e. columns), you should avoid adding any columns to the Scan, in which case coprocessor implementation for AggregateClient, will add FirstKeyOnlyFilter to the Scan, so to avoid loading unnecessary columns, so this should result in some speed up.
This is similar approach to what hbase shell 'count' implementation does, although reduction in overhead in that case is bigger, since data transfer from region server to client (shell) is minimized, whereas in case of coprocessor, data does not leave region server, so most of the improvement in that case should come from avoiding loading of unnecessary files. Not sure how this will apply to your particular case, given that data set per row seems to be rather small. Also, in case of AggregateClient you will benefit if/when your tables span multiple regions. Essentially, performance of this approach will 'degrade' as your table gets bigger, but only to the point when it splits, from which point it should be pretty constant. Having this in mind, and your type of data, you might consider pre-splitting your tables. DISCLAIMER: this is mostly theoretical, since I'm not an expert in hbase internals :), so your best bet is to try it - I'm too lazy to verify impact my self ;) Finally, if your case can tolerate eventual consistency of counters with actual number of rows, you can, as already suggested, have RowCounter map reduce run every once in a while, write the counter(s) back to hbase, and read those when you need to obtain the number of rows. Regards, Vedad -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Speeding-up-the-row-count-tp4042378p4042415.html Sent from the HBase User mailing list archive at Nabble.com.
