Re: HBase Table Row Count Optimization - A Solicitation For Help

James Birchfield Fri, 20 Sep 2013 17:09:23 -0700

Agree with our first statement. I am in no way saying HBase is being used 
properly as a store.  I am only saying my task is to determine the row counts 
as accurately as possible for the data and setup we currently have.


I set the scan caching to 1000.  I tried 10000, but did not see much of a 
performance increase.

I will look further into coprocessors.  Since I am relatively new to the 
technology, can someone provide a quick answer to this?  Will using a 
coprocessor require me to change and restart our cluster?  I am assuming is is 
possibly a configuration thing?  If so, I will have to see if that is an 
option.  If the answer is no, great.  If yes, and it is an option for me, I 
will def take a look at this approach.

Thanks!
Birxch
On Sep 20, 2013, at 4:56 PM, lars hofhansl <[email protected]> wrote:

> Hi James,
> 
> do you need that many tables? "Table" in HBase should have been call 
> "KeySpace" instead. 600 is lot.
> 
> But anyway... Did you enabled scanner caching for your M/R job (if you didn't 
> every next() will be a roundtrip to the RegionServer and you end up measuring 
> your networks RTT)?
> Are you IO bound?
> 
> 
> Lastly instead of doing it as M/R (which has to bring all the data back to 
> the mapper just to count the returned rows), you could use a coprocessor, 
> which do the counting on the server (or use Phoenix, search back in the 
> archives for an example that James Taylor gave for row counting).
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: James Birchfield <[email protected]>
> To: [email protected] 
> Sent: Friday, September 20, 2013 2:47 PM
> Subject: HBase Table Row Count Optimization - A Solicitation For Help
> 
> 
>     After reading the documentation and scouring the mailing list archives, I 
> understand there is no real support for fast row counting in HBase unless you 
> build some sort of tracking logic into your code.  In our case, we do not 
> have such logic, and have massive amounts of data already persisted.  I am 
> running into the issue of very long execution of the RowCounter MapReduce job 
> against very large tables (multi-billion for many is our estimate).  I 
> understand why this issue exists and am slowly accepting it, but I am hoping 
> I can solicit some possible ideas to help speed things up a little.
>     
>     My current task is to provide total row counts on about 600 tables, some 
> extremely large, some not so much.  Currently, I have a process that executes 
> the MapRduce job in process like so:
>     
>             Job job = RowCounter.createSubmittableJob(
>                     ConfigManager.getConfiguration(), new 
> String[]{tableName});
>             boolean waitForCompletion = job.waitForCompletion(true);
>             Counters counters = job.getCounters();
>             Counter rowCounter = 
> counters.findCounter(hbaseadminconnection.Counters.ROWS);
>             return rowCounter.getValue();
>             
>     At the moment, each MapReduce job is executed in serial order, so 
> counting one table at a time.  For the current implementation of this whole 
> process, as it stands right now, my rough timing calculations indicate that 
> fully counting all the rows of these 600 tables will take anywhere between 11 
> to 22 days.  This is not what I consider a desirable timeframe.
> 
>     I have considered three alternative approaches to speed things up.
> 
>     First, since the application is not heavily CPU bound, I could use a 
> ThreadPool and execute multiple MapReduce jobs at the same time looking at 
> different tables.  I have never done this, so I am unsure if this would cause 
> any unanticipated side effects.  
> 
>     Second, I could distribute the processes.  I could find as many machines 
> that can successfully talk to the desired cluster properly, give them a 
> subset of tables to work on, and then combine the results post process.
> 
>     Third, I could combine both the above approaches and run a distributed 
> set of multithreaded process to execute the MapReduce jobs in parallel.
> 
>     Although it seems to have been asked and answered many times, I will ask 
> once again.  Without the need to change our current configurations or restart 
> the clusters, is there a faster approach to obtain row counts?  FYI, my cache 
> size for the Scan is set to 1000.  I have experimented with different 
> numbers, but nothing made a noticeable difference.  Any advice or feedback 
> would be greatly appreciated!
> 
> Thanks,
> Birch

smime.p7s
Description: S/MIME cryptographic signature

Re: HBase Table Row Count Optimization - A Solicitation For Help

Reply via email to