I did not implement accurate timing, but the current table being counted has been running for about 10 hours, and the log is estimating the map portion at 10%
2013-09-20 23:40:24,099 INFO [main] Job : map 10% reduce 0% So a loooong time. Like I mentioned, we have billions, if not trillions of rows potentially. Thanks for the feedback on the approaches I mentioned. I was not sure if they would have any effect overall. I will look further into coprocessors. Thanks! Birch On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[email protected]> wrote: > How long does it take for RowCounter Job for largest table to finish on your > cluster? > > Just curious. > > On your options: > > 1. Not worth it probably - you may overload your cluster > 2. Not sure this one differs from 1. Looks the same to me but more complex. > 3. The same as 1 and 2 > > Counting rows in efficient way can be done if you sacrifice some accuracy : > > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html > > Yeah, you will need coprocessors for that. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [email protected] > > ________________________________________ > From: James Birchfield [[email protected]] > Sent: Friday, September 20, 2013 3:50 PM > To: [email protected] > Subject: Re: HBase Table Row Count Optimization - A Solicitation For Help > > Hadoop 2.0.0-cdh4.3.1 > > HBase 0.94.6-cdh4.3.1 > > 110 servers, 0 dead, 238.2364 average load > > Some other info, not sure if it helps or not. > > Configured Capacity: 1295277834158080 (1.15 PB) > Present Capacity: 1224692609430678 (1.09 PB) > DFS Remaining: 624376503857152 (567.87 TB) > DFS Used: 600316105573526 (545.98 TB) > DFS Used%: 49.02% > Under replicated blocks: 0 > Blocks with corrupt replicas: 1 > Missing blocks: 0 > > It is hitting a production cluster, but I am not really sure how to calculate > the load placed on the cluster. > On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote: > >> How many nodes do you have in your cluster ? >> >> When counting rows, what other load would be placed on the cluster ? >> >> What is the HBase version you're currently using / planning to use ? >> >> Thanks >> >> >> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield < >> [email protected]> wrote: >> >>> After reading the documentation and scouring the mailing list >>> archives, I understand there is no real support for fast row counting in >>> HBase unless you build some sort of tracking logic into your code. In our >>> case, we do not have such logic, and have massive amounts of data already >>> persisted. I am running into the issue of very long execution of the >>> RowCounter MapReduce job against very large tables (multi-billion for many >>> is our estimate). I understand why this issue exists and am slowly >>> accepting it, but I am hoping I can solicit some possible ideas to help >>> speed things up a little. >>> >>> My current task is to provide total row counts on about 600 >>> tables, some extremely large, some not so much. Currently, I have a >>> process that executes the MapRduce job in process like so: >>> >>> Job job = RowCounter.createSubmittableJob( >>> ConfigManager.getConfiguration(), >>> new String[]{tableName}); >>> boolean waitForCompletion = >>> job.waitForCompletion(true); >>> Counters counters = job.getCounters(); >>> Counter rowCounter = >>> counters.findCounter(hbaseadminconnection.Counters.ROWS); >>> return rowCounter.getValue(); >>> >>> At the moment, each MapReduce job is executed in serial order, so >>> counting one table at a time. For the current implementation of this whole >>> process, as it stands right now, my rough timing calculations indicate that >>> fully counting all the rows of these 600 tables will take anywhere between >>> 11 to 22 days. This is not what I consider a desirable timeframe. >>> >>> I have considered three alternative approaches to speed things up. >>> >>> First, since the application is not heavily CPU bound, I could use >>> a ThreadPool and execute multiple MapReduce jobs at the same time looking >>> at different tables. I have never done this, so I am unsure if this would >>> cause any unanticipated side effects. >>> >>> Second, I could distribute the processes. I could find as many >>> machines that can successfully talk to the desired cluster properly, give >>> them a subset of tables to work on, and then combine the results post >>> process. >>> >>> Third, I could combine both the above approaches and run a >>> distributed set of multithreaded process to execute the MapReduce jobs in >>> parallel. >>> >>> Although it seems to have been asked and answered many times, I >>> will ask once again. Without the need to change our current configurations >>> or restart the clusters, is there a faster approach to obtain row counts? >>> FYI, my cache size for the Scan is set to 1000. I have experimented with >>> different numbers, but nothing made a noticeable difference. Any advice or >>> feedback would be greatly appreciated! >>> >>> Thanks, >>> Birch > > > Confidentiality Notice: The information contained in this message, including > any attachments hereto, may be confidential and is intended to be read only > by the individual or entity to whom this message is addressed. If the reader > of this message is not the intended recipient or an agent or designee of the > intended recipient, please note that any review, use, disclosure or > distribution of this message or its attachments, in any form, is strictly > prohibited. If you have received this message in error, please immediately > notify the sender and/or [email protected] and delete or destroy > any copy of this message and its attachments.
smime.p7s
Description: S/MIME cryptographic signature
