Another thing is be careful about CF/attributes you have in the Scan.  If you 
add a column family (scan.addFamily) , it will pull *all* the attributes of 
that column family.  If you only care about a row-count, pick only one very 
small attribute from the row.  


-----Original Message-----
From: Wojciech Langiewicz [mailto:[email protected]] 
Sent: Sunday, May 01, 2011 2:12 PM
To: [email protected]
Subject: Re: Row count without iterating over ResultScanner?

Yes, I was using default caching, setting this value to few thousands made 
significant difference in performance, I'll experiment more with this option.

Right now I want to stay away from MR, mainly because of cluster warm-up time, 
and I want to get results almost real-time (few seconds max).

Thanks for the tip on caching!

On 01.05.2011 19:55, Doug Meil wrote:
> What caching value are you using on the scan?  If you aren't setting this, 
> it's probably using the default - which is 1.  Which is slow.   
> http://hbase.apache.org/book.html#d379e3504
>
> Re:  "I would like to use HBase API, not MR job (because this cluster only 
> has HDFS and HBase installed)."
>
> For Very Large tables you want to start using an MR job for this.
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:[email protected]]
> Sent: Sunday, May 01, 2011 9:44 AM
> To: [email protected]
> Subject: Row count without iterating over ResultScanner?
>
> Hi,
> I would like to know if there's a way to quickly count number of rows from 
> scan result?
> Right now I'm iterating over ResultScanner like this:
> int count = 0;
> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>       ++count;
> }
> But with number of rows reaching millions this takes a while.
> I tried to find something in documentation, but I didn't found anything.
> I would like to use HBase API, not MR job (because this cluster only has HDFS 
> and HBase installed).
>
> Thanks for all help.
>
> --
> Wojciech Langiewicz

Reply via email to