Hi James,

do you need that many tables? "Table" in HBase should have been call "KeySpace" 
instead. 600 is lot.

But anyway... Did you enabled scanner caching for your M/R job (if you didn't 
every next() will be a roundtrip to the RegionServer and you end up measuring 
your networks RTT)?
Are you IO bound?


Lastly instead of doing it as M/R (which has to bring all the data back to the 
mapper just to count the returned rows), you could use a coprocessor, which do 
the counting on the server (or use Phoenix, search back in the archives for an 
example that James Taylor gave for row counting).

-- Lars



________________________________
 From: James Birchfield <[email protected]>
To: [email protected] 
Sent: Friday, September 20, 2013 2:47 PM
Subject: HBase Table Row Count Optimization - A Solicitation For Help
 

    After reading the documentation and scouring the mailing list archives, I 
understand there is no real support for fast row counting in HBase unless you 
build some sort of tracking logic into your code.  In our case, we do not have 
such logic, and have massive amounts of data already persisted.  I am running 
into the issue of very long execution of the RowCounter MapReduce job against 
very large tables (multi-billion for many is our estimate).  I understand why 
this issue exists and am slowly accepting it, but I am hoping I can solicit 
some possible ideas to help speed things up a little.
    
    My current task is to provide total row counts on about 600 tables, some 
extremely large, some not so much.  Currently, I have a process that executes 
the MapRduce job in process like so:
    
            Job job = RowCounter.createSubmittableJob(
                    ConfigManager.getConfiguration(), new String[]{tableName});
            boolean waitForCompletion = job.waitForCompletion(true);
            Counters counters = job.getCounters();
            Counter rowCounter = 
counters.findCounter(hbaseadminconnection.Counters.ROWS);
            return rowCounter.getValue();
            
    At the moment, each MapReduce job is executed in serial order, so counting 
one table at a time.  For the current implementation of this whole process, as 
it stands right now, my rough timing calculations indicate that fully counting 
all the rows of these 600 tables will take anywhere between 11 to 22 days.  
This is not what I consider a desirable timeframe.

    I have considered three alternative approaches to speed things up.

    First, since the application is not heavily CPU bound, I could use a 
ThreadPool and execute multiple MapReduce jobs at the same time looking at 
different tables.  I have never done this, so I am unsure if this would cause 
any unanticipated side effects.  

    Second, I could distribute the processes.  I could find as many machines 
that can successfully talk to the desired cluster properly, give them a subset 
of tables to work on, and then combine the results post process.

    Third, I could combine both the above approaches and run a distributed set 
of multithreaded process to execute the MapReduce jobs in parallel.

    Although it seems to have been asked and answered many times, I will ask 
once again.  Without the need to change our current configurations or restart 
the clusters, is there a faster approach to obtain row counts?  FYI, my cache 
size for the Scan is set to 1000.  I have experimented with different numbers, 
but nothing made a noticeable difference.  Any advice or feedback would be 
greatly appreciated!

Thanks,
Birch

Reply via email to