Agree with our first statement. I am in no way saying HBase is being used properly as a store. I am only saying my task is to determine the row counts as accurately as possible for the data and setup we currently have.
I set the scan caching to 1000. I tried 10000, but did not see much of a performance increase. I will look further into coprocessors. Since I am relatively new to the technology, can someone provide a quick answer to this? Will using a coprocessor require me to change and restart our cluster? I am assuming is is possibly a configuration thing? If so, I will have to see if that is an option. If the answer is no, great. If yes, and it is an option for me, I will def take a look at this approach. Thanks! Birxch On Sep 20, 2013, at 4:56 PM, lars hofhansl <[email protected]> wrote: > Hi James, > > do you need that many tables? "Table" in HBase should have been call > "KeySpace" instead. 600 is lot. > > But anyway... Did you enabled scanner caching for your M/R job (if you didn't > every next() will be a roundtrip to the RegionServer and you end up measuring > your networks RTT)? > Are you IO bound? > > > Lastly instead of doing it as M/R (which has to bring all the data back to > the mapper just to count the returned rows), you could use a coprocessor, > which do the counting on the server (or use Phoenix, search back in the > archives for an example that James Taylor gave for row counting). > > -- Lars > > > > ________________________________ > From: James Birchfield <[email protected]> > To: [email protected] > Sent: Friday, September 20, 2013 2:47 PM > Subject: HBase Table Row Count Optimization - A Solicitation For Help > > > After reading the documentation and scouring the mailing list archives, I > understand there is no real support for fast row counting in HBase unless you > build some sort of tracking logic into your code. In our case, we do not > have such logic, and have massive amounts of data already persisted. I am > running into the issue of very long execution of the RowCounter MapReduce job > against very large tables (multi-billion for many is our estimate). I > understand why this issue exists and am slowly accepting it, but I am hoping > I can solicit some possible ideas to help speed things up a little. > > My current task is to provide total row counts on about 600 tables, some > extremely large, some not so much. Currently, I have a process that executes > the MapRduce job in process like so: > > Job job = RowCounter.createSubmittableJob( > ConfigManager.getConfiguration(), new > String[]{tableName}); > boolean waitForCompletion = job.waitForCompletion(true); > Counters counters = job.getCounters(); > Counter rowCounter = > counters.findCounter(hbaseadminconnection.Counters.ROWS); > return rowCounter.getValue(); > > At the moment, each MapReduce job is executed in serial order, so > counting one table at a time. For the current implementation of this whole > process, as it stands right now, my rough timing calculations indicate that > fully counting all the rows of these 600 tables will take anywhere between 11 > to 22 days. This is not what I consider a desirable timeframe. > > I have considered three alternative approaches to speed things up. > > First, since the application is not heavily CPU bound, I could use a > ThreadPool and execute multiple MapReduce jobs at the same time looking at > different tables. I have never done this, so I am unsure if this would cause > any unanticipated side effects. > > Second, I could distribute the processes. I could find as many machines > that can successfully talk to the desired cluster properly, give them a > subset of tables to work on, and then combine the results post process. > > Third, I could combine both the above approaches and run a distributed > set of multithreaded process to execute the MapReduce jobs in parallel. > > Although it seems to have been asked and answered many times, I will ask > once again. Without the need to change our current configurations or restart > the clusters, is there a faster approach to obtain row counts? FYI, my cache > size for the Scan is set to 1000. I have experimented with different > numbers, but nothing made a noticeable difference. Any advice or feedback > would be greatly appreciated! > > Thanks, > Birch
smime.p7s
Description: S/MIME cryptographic signature
