How many nodes do you have in your cluster ? When counting rows, what other load would be placed on the cluster ?
What is the HBase version you're currently using / planning to use ? Thanks On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield < [email protected]> wrote: > After reading the documentation and scouring the mailing list > archives, I understand there is no real support for fast row counting in > HBase unless you build some sort of tracking logic into your code. In our > case, we do not have such logic, and have massive amounts of data already > persisted. I am running into the issue of very long execution of the > RowCounter MapReduce job against very large tables (multi-billion for many > is our estimate). I understand why this issue exists and am slowly > accepting it, but I am hoping I can solicit some possible ideas to help > speed things up a little. > > My current task is to provide total row counts on about 600 > tables, some extremely large, some not so much. Currently, I have a > process that executes the MapRduce job in process like so: > > Job job = RowCounter.createSubmittableJob( > ConfigManager.getConfiguration(), > new String[]{tableName}); > boolean waitForCompletion = > job.waitForCompletion(true); > Counters counters = job.getCounters(); > Counter rowCounter = > counters.findCounter(hbaseadminconnection.Counters.ROWS); > return rowCounter.getValue(); > > At the moment, each MapReduce job is executed in serial order, so > counting one table at a time. For the current implementation of this whole > process, as it stands right now, my rough timing calculations indicate that > fully counting all the rows of these 600 tables will take anywhere between > 11 to 22 days. This is not what I consider a desirable timeframe. > > I have considered three alternative approaches to speed things up. > > First, since the application is not heavily CPU bound, I could use > a ThreadPool and execute multiple MapReduce jobs at the same time looking > at different tables. I have never done this, so I am unsure if this would > cause any unanticipated side effects. > > Second, I could distribute the processes. I could find as many > machines that can successfully talk to the desired cluster properly, give > them a subset of tables to work on, and then combine the results post > process. > > Third, I could combine both the above approaches and run a > distributed set of multithreaded process to execute the MapReduce jobs in > parallel. > > Although it seems to have been asked and answered many times, I > will ask once again. Without the need to change our current configurations > or restart the clusters, is there a faster approach to obtain row counts? > FYI, my cache size for the Scan is set to 1000. I have experimented with > different numbers, but nothing made a noticeable difference. Any advice or > feedback would be greatly appreciated! > > Thanks, > Birch
