So, just to clarify where I am at this point, I have learned that I was absolutely not taking advantage of the cluster doing it the way I was. Some quick tests running the 'correct' way, from the command line, using the built in RowCounter MapReduce job works orders of magnitude faster than what I am seeing.
So, my apologies for seeking help for a problem when I fully didn't understand the technology and the proper use of it. However, I am very glad that this community was able to point this out and clue me in. For that I am very, very appreciative. I will rework my logic to use this technique, probably creating a customized RowCounter MapReduce impl that can count multiple table at one instead of having to issue 600 individual requests. Thanks again!!! Birch On Sep 20, 2013, at 6:57 PM, James Birchfield <[email protected]> wrote: > Yes, we have a fully setup cluster complete with all you pointed out. But I > believe, now that it has been pointed out to me in this thread and your > reply, is exactly as you and Lars say. I am running the MapReduce in process > from a standalone java process, and I believe it is not taking advantage of > that infrastructure. > > So I will pull this all out of the process, and run it on the cluster using > the example I have read about. > > It is most likely just my ignorance leading to the root cause of this > problem. All the help is very appreciative. > > Thanks! > Birch > On Sep 20, 2013, at 6:46 PM, Bryan Beaudreault <[email protected]> > wrote: > >> I could be wrong, but based on the info in your most recent emails and the >> logs therein as well, I believe you may be running this job as a single >> process. >> >> Do you actually have a full hadoop setup running, with a jobtracker and >> tasktrackers? In the absence of proper configuration, the hadoop code will >> simply launch a local, single-process job. The LocalJobRunner referenced >> in your logs points to that. >> >> If this is the case you are likely only running a single mapper and >> reducer, or at most running a few mappers at once in threads in your local >> process. Either way this would obviously greatly limit the throughput. >> >> If you have a full hadoop set-up, make sure the client (dev machine) you >> are running this job from has access to a mapred-site.xml and hdfs-site.xml >> configuration file, or at the very least set the mapred.job.tracker value >> manually in your job configuration before submitting. >> >> Let me know if I'm totally off base here. >> >> >> On Fri, Sep 20, 2013 at 9:34 PM, James Birchfield < >> [email protected]> wrote: >> >>> Excellent! Will do! >>> >>> Birchj >>> On Sep 20, 2013, at 6:32 PM, Ted Yu <[email protected]> wrote: >>> >>>> Please take a look at the javadoc >>>> for >>> src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java >>>> >>>> As long as the machine can reach your HBase cluster, you should be able >>> to >>>> run AggregationClient and utilize the AggregateImplementation endpoint in >>>> the region servers. >>>> >>>> Cheers >>>> >>>> >>>> On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield < >>>> [email protected]> wrote: >>>> >>>>> Thanks Ted. >>>>> >>>>> That was the direction I have been working towards as I am learning >>> today. >>>>> Much appreciation to all the replies to this thread. >>>>> >>>>> Whether I keep the MapReduce job or utilize the Aggregation coprocessor >>>>> (which is turning out that it should be possible for me here), I need to >>>>> make sure I am running the client in an efficient manner. Lars may have >>>>> hit upon the core problem. I am not running the map reduce job on the >>>>> cluster, but rather from a stand alone remote java client executing the >>> job >>>>> in process. This may very well turn out to be the number one issue. I >>>>> would love it if this turns out to be true. Would make this a great >>>>> learning lesson for me as a relative newcomer to working with HBase, and >>>>> potentially allow me to finish this initial task much quicker than I was >>>>> thinking. >>>>> >>>>> So assuming the MapReduce jobs need to be run on the cluster instead of >>>>> locally, does a coprocessor endpoint client need to be run the same, or >>> is >>>>> it safe to run it on a remote machine since the work gets distributed >>> out >>>>> to the region servers? Just wondering if I would run into the same >>> issues >>>>> if what I said above holds true. >>>>> >>>>> Thanks! >>>>> Birch >>>>> On Sep 20, 2013, at 6:17 PM, Ted Yu <[email protected]> wrote: >>>>> >>>>>> In 0.94, we have AggregateImplementation, an endpoint coprocessor, >>> which >>>>>> implements getRowNum(). >>>>>> >>>>>> Example is in AggregationClient.java >>>>>> >>>>>> Cheers >>>>>> >>>>>> >>>>>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <[email protected]> >>> wrote: >>>>>> >>>>>>> From your numbers below you have about 26k regions, thus each region >>> is >>>>>>> about 545tb/26k = 20gb. Good. >>>>>>> >>>>>>> How many mappers are you running? >>>>>>> And just to rule out the obvious, the M/R is running on the cluster >>> and >>>>>>> not locally, right? (it will default to a local runner when it cannot >>>>> use >>>>>>> the M/R cluster). >>>>>>> >>>>>>> Some back of the envelope calculations tell me that assuming 1ge >>> network >>>>>>> cards, the best you can expect for 110 machines to map through this >>>>> data is >>>>>>> about 10h. (so way faster than what you see). >>>>>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h) >>>>>>> >>>>>>> >>>>>>> We should really add a rowcounting coprocessor to HBase and allow >>> using >>>>> it >>>>>>> via M/R. >>>>>>> >>>>>>> -- Lars >>>>>>> >>>>>>> >>>>>>> >>>>>>> ________________________________ >>>>>>> From: James Birchfield <[email protected]> >>>>>>> To: [email protected] >>>>>>> Sent: Friday, September 20, 2013 5:09 PM >>>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For >>>>> Help >>>>>>> >>>>>>> >>>>>>> I did not implement accurate timing, but the current table being >>> counted >>>>>>> has been running for about 10 hours, and the log is estimating the map >>>>>>> portion at 10% >>>>>>> >>>>>>> 2013-09-20 23:40:24,099 INFO [main] Job : >>>>> map >>>>>>> 10% reduce 0% >>>>>>> >>>>>>> So a loooong time. Like I mentioned, we have billions, if not >>> trillions >>>>>>> of rows potentially. >>>>>>> >>>>>>> Thanks for the feedback on the approaches I mentioned. I was not sure >>>>> if >>>>>>> they would have any effect overall. >>>>>>> >>>>>>> I will look further into coprocessors. >>>>>>> >>>>>>> Thanks! >>>>>>> Birch >>>>>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov < >>> [email protected] >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> How long does it take for RowCounter Job for largest table to finish >>> on >>>>>>> your cluster? >>>>>>>> >>>>>>>> Just curious. >>>>>>>> >>>>>>>> On your options: >>>>>>>> >>>>>>>> 1. Not worth it probably - you may overload your cluster >>>>>>>> 2. Not sure this one differs from 1. Looks the same to me but more >>>>>>> complex. >>>>>>>> 3. The same as 1 and 2 >>>>>>>> >>>>>>>> Counting rows in efficient way can be done if you sacrifice some >>>>>>> accuracy : >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html >>>>>>>> >>>>>>>> Yeah, you will need coprocessors for that. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Vladimir Rodionov >>>>>>>> Principal Platform Engineer >>>>>>>> Carrier IQ, www.carrieriq.com >>>>>>>> e-mail: [email protected] >>>>>>>> >>>>>>>> ________________________________________ >>>>>>>> From: James Birchfield [[email protected]] >>>>>>>> Sent: Friday, September 20, 2013 3:50 PM >>>>>>>> To: [email protected] >>>>>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For >>>>> Help >>>>>>>> >>>>>>>> Hadoop 2.0.0-cdh4.3.1 >>>>>>>> >>>>>>>> HBase 0.94.6-cdh4.3.1 >>>>>>>> >>>>>>>> 110 servers, 0 dead, 238.2364 average load >>>>>>>> >>>>>>>> Some other info, not sure if it helps or not. >>>>>>>> >>>>>>>> Configured Capacity: 1295277834158080 (1.15 PB) >>>>>>>> Present Capacity: 1224692609430678 (1.09 PB) >>>>>>>> DFS Remaining: 624376503857152 (567.87 TB) >>>>>>>> DFS Used: 600316105573526 (545.98 TB) >>>>>>>> DFS Used%: 49.02% >>>>>>>> Under replicated blocks: 0 >>>>>>>> Blocks with corrupt replicas: 1 >>>>>>>> Missing blocks: 0 >>>>>>>> >>>>>>>> It is hitting a production cluster, but I am not really sure how to >>>>>>> calculate the load placed on the cluster. >>>>>>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote: >>>>>>>> >>>>>>>>> How many nodes do you have in your cluster ? >>>>>>>>> >>>>>>>>> When counting rows, what other load would be placed on the cluster ? >>>>>>>>> >>>>>>>>> What is the HBase version you're currently using / planning to use ? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> After reading the documentation and scouring the mailing list >>>>>>>>>> archives, I understand there is no real support for fast row >>> counting >>>>>>> in >>>>>>>>>> HBase unless you build some sort of tracking logic into your code. >>>>> In >>>>>>> our >>>>>>>>>> case, we do not have such logic, and have massive amounts of data >>>>>>> already >>>>>>>>>> persisted. I am running into the issue of very long execution of >>> the >>>>>>>>>> RowCounter MapReduce job against very large tables (multi-billion >>> for >>>>>>> many >>>>>>>>>> is our estimate). I understand why this issue exists and am slowly >>>>>>>>>> accepting it, but I am hoping I can solicit some possible ideas to >>>>> help >>>>>>>>>> speed things up a little. >>>>>>>>>> >>>>>>>>>> My current task is to provide total row counts on about 600 >>>>>>>>>> tables, some extremely large, some not so much. Currently, I have >>> a >>>>>>>>>> process that executes the MapRduce job in process like so: >>>>>>>>>> >>>>>>>>>> Job job = RowCounter.createSubmittableJob( >>>>>>>>>> >>>>> ConfigManager.getConfiguration(), >>>>>>>>>> new String[]{tableName}); >>>>>>>>>> boolean waitForCompletion = >>>>>>>>>> job.waitForCompletion(true); >>>>>>>>>> Counters counters = job.getCounters(); >>>>>>>>>> Counter rowCounter = >>>>>>>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS); >>>>>>>>>> return rowCounter.getValue(); >>>>>>>>>> >>>>>>>>>> At the moment, each MapReduce job is executed in serial order, >>>>> so >>>>>>>>>> counting one table at a time. For the current implementation of >>> this >>>>>>> whole >>>>>>>>>> process, as it stands right now, my rough timing calculations >>>>> indicate >>>>>>> that >>>>>>>>>> fully counting all the rows of these 600 tables will take anywhere >>>>>>> between >>>>>>>>>> 11 to 22 days. This is not what I consider a desirable timeframe. >>>>>>>>>> >>>>>>>>>> I have considered three alternative approaches to speed things >>>>>>> up. >>>>>>>>>> >>>>>>>>>> First, since the application is not heavily CPU bound, I could >>>>>>> use >>>>>>>>>> a ThreadPool and execute multiple MapReduce jobs at the same time >>>>>>> looking >>>>>>>>>> at different tables. I have never done this, so I am unsure if >>> this >>>>>>> would >>>>>>>>>> cause any unanticipated side effects. >>>>>>>>>> >>>>>>>>>> Second, I could distribute the processes. I could find as many >>>>>>>>>> machines that can successfully talk to the desired cluster >>> properly, >>>>>>> give >>>>>>>>>> them a subset of tables to work on, and then combine the results >>> post >>>>>>>>>> process. >>>>>>>>>> >>>>>>>>>> Third, I could combine both the above approaches and run a >>>>>>>>>> distributed set of multithreaded process to execute the MapReduce >>>>> jobs >>>>>>> in >>>>>>>>>> parallel. >>>>>>>>>> >>>>>>>>>> Although it seems to have been asked and answered many times, I >>>>>>>>>> will ask once again. Without the need to change our current >>>>>>> configurations >>>>>>>>>> or restart the clusters, is there a faster approach to obtain row >>>>>>> counts? >>>>>>>>>> FYI, my cache size for the Scan is set to 1000. I have >>> experimented >>>>>>> with >>>>>>>>>> different numbers, but nothing made a noticeable difference. Any >>>>>>> advice or >>>>>>>>>> feedback would be greatly appreciated! >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Birch >>>>>>>> >>>>>>>> >>>>>>>> Confidentiality Notice: The information contained in this message, >>>>>>> including any attachments hereto, may be confidential and is intended >>>>> to be >>>>>>> read only by the individual or entity to whom this message is >>>>> addressed. If >>>>>>> the reader of this message is not the intended recipient or an agent >>> or >>>>>>> designee of the intended recipient, please note that any review, use, >>>>>>> disclosure or distribution of this message or its attachments, in any >>>>> form, >>>>>>> is strictly prohibited. If you have received this message in error, >>>>> please >>>>>>> immediately notify the sender and/or [email protected] and >>>>>>> delete or destroy any copy of this message and its attachments. >>>>>>> >>>>> >>>>> >>> >>> >
smime.p7s
Description: S/MIME cryptographic signature
