bq. FirstKeyFilter *should* be faster since it only grabs the first KV pair.
Minor correction: FirstKeyFilter above should be FirstKeyOnlyFilter On Fri, Sep 20, 2013 at 5:53 PM, James Birchfield < [email protected]> wrote: > Thanks for the info. > > Right now the MapReduce Scan uses the FirstKeyOnlyFilter. From what I > have read in the javadoc, FirstKeyFilter *should* be faster since it only > grabs the first KV pair. > > I will play around with setting the caching size to a much higher number > and see how it performs. I do not think I have too much wiggle room to > modify our cluster configurations, but will see what I can do. > > Thanks! > > Birch > On Sep 20, 2013, at 5:39 PM, Bryan Beaudreault <[email protected]> > wrote: > > > If your cells are extremely small try setting the caching even higher > than > > 10k. You want to strike a balance between MBs of response size and > number > > of calls needed, leaning towards larger response sizes as far as your > > system can handle (account for RS max memory, and memory available to > your > > mappers). > > > > You could use the KeyOnlyFilter to further limit the sizes of responses > > transferred. > > > > Another thing that may help would be increasing your block size. This > > would speed up sequential read but slow down random access. It would be > a > > matter of making the config change and then running a major compaction to > > re-write the data. > > > > We constantly run multiple MR jobs (often on the order of 10's) against > the > > same hbase cluster and don't often see issues. They are not full table > > scans, but they do often overlap. I think it would be worth at least > > attempting to run multiple jobs at once. > > > > > > > > > > On Fri, Sep 20, 2013 at 8:09 PM, James Birchfield < > > [email protected]> wrote: > > > >> I did not implement accurate timing, but the current table being counted > >> has been running for about 10 hours, and the log is estimating the map > >> portion at 10% > >> > >> 2013-09-20 23:40:24,099 INFO [main] Job : > map > >> 10% reduce 0% > >> > >> So a loooong time. Like I mentioned, we have billions, if not trillions > >> of rows potentially. > >> > >> Thanks for the feedback on the approaches I mentioned. I was not sure > if > >> they would have any effect overall. > >> > >> I will look further into coprocessors. > >> > >> Thanks! > >> Birch > >> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov <[email protected] > > > >> wrote: > >> > >>> How long does it take for RowCounter Job for largest table to finish on > >> your cluster? > >>> > >>> Just curious. > >>> > >>> On your options: > >>> > >>> 1. Not worth it probably - you may overload your cluster > >>> 2. Not sure this one differs from 1. Looks the same to me but more > >> complex. > >>> 3. The same as 1 and 2 > >>> > >>> Counting rows in efficient way can be done if you sacrifice some > >> accuracy : > >>> > >>> > >> > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html > >>> > >>> Yeah, you will need coprocessors for that. > >>> > >>> Best regards, > >>> Vladimir Rodionov > >>> Principal Platform Engineer > >>> Carrier IQ, www.carrieriq.com > >>> e-mail: [email protected] > >>> > >>> ________________________________________ > >>> From: James Birchfield [[email protected]] > >>> Sent: Friday, September 20, 2013 3:50 PM > >>> To: [email protected] > >>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For > Help > >>> > >>> Hadoop 2.0.0-cdh4.3.1 > >>> > >>> HBase 0.94.6-cdh4.3.1 > >>> > >>> 110 servers, 0 dead, 238.2364 average load > >>> > >>> Some other info, not sure if it helps or not. > >>> > >>> Configured Capacity: 1295277834158080 (1.15 PB) > >>> Present Capacity: 1224692609430678 (1.09 PB) > >>> DFS Remaining: 624376503857152 (567.87 TB) > >>> DFS Used: 600316105573526 (545.98 TB) > >>> DFS Used%: 49.02% > >>> Under replicated blocks: 0 > >>> Blocks with corrupt replicas: 1 > >>> Missing blocks: 0 > >>> > >>> It is hitting a production cluster, but I am not really sure how to > >> calculate the load placed on the cluster. > >>> On Sep 20, 2013, at 3:19 PM, Ted Yu <[email protected]> wrote: > >>> > >>>> How many nodes do you have in your cluster ? > >>>> > >>>> When counting rows, what other load would be placed on the cluster ? > >>>> > >>>> What is the HBase version you're currently using / planning to use ? > >>>> > >>>> Thanks > >>>> > >>>> > >>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield < > >>>> [email protected]> wrote: > >>>> > >>>>> After reading the documentation and scouring the mailing list > >>>>> archives, I understand there is no real support for fast row counting > >> in > >>>>> HBase unless you build some sort of tracking logic into your code. > In > >> our > >>>>> case, we do not have such logic, and have massive amounts of data > >> already > >>>>> persisted. I am running into the issue of very long execution of the > >>>>> RowCounter MapReduce job against very large tables (multi-billion for > >> many > >>>>> is our estimate). I understand why this issue exists and am slowly > >>>>> accepting it, but I am hoping I can solicit some possible ideas to > help > >>>>> speed things up a little. > >>>>> > >>>>> My current task is to provide total row counts on about 600 > >>>>> tables, some extremely large, some not so much. Currently, I have a > >>>>> process that executes the MapRduce job in process like so: > >>>>> > >>>>> Job job = RowCounter.createSubmittableJob( > >>>>> > ConfigManager.getConfiguration(), > >>>>> new String[]{tableName}); > >>>>> boolean waitForCompletion = > >>>>> job.waitForCompletion(true); > >>>>> Counters counters = job.getCounters(); > >>>>> Counter rowCounter = > >>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS); > >>>>> return rowCounter.getValue(); > >>>>> > >>>>> At the moment, each MapReduce job is executed in serial order, > so > >>>>> counting one table at a time. For the current implementation of this > >> whole > >>>>> process, as it stands right now, my rough timing calculations > indicate > >> that > >>>>> fully counting all the rows of these 600 tables will take anywhere > >> between > >>>>> 11 to 22 days. This is not what I consider a desirable timeframe. > >>>>> > >>>>> I have considered three alternative approaches to speed things > >> up. > >>>>> > >>>>> First, since the application is not heavily CPU bound, I could > >> use > >>>>> a ThreadPool and execute multiple MapReduce jobs at the same time > >> looking > >>>>> at different tables. I have never done this, so I am unsure if this > >> would > >>>>> cause any unanticipated side effects. > >>>>> > >>>>> Second, I could distribute the processes. I could find as many > >>>>> machines that can successfully talk to the desired cluster properly, > >> give > >>>>> them a subset of tables to work on, and then combine the results post > >>>>> process. > >>>>> > >>>>> Third, I could combine both the above approaches and run a > >>>>> distributed set of multithreaded process to execute the MapReduce > jobs > >> in > >>>>> parallel. > >>>>> > >>>>> Although it seems to have been asked and answered many times, I > >>>>> will ask once again. Without the need to change our current > >> configurations > >>>>> or restart the clusters, is there a faster approach to obtain row > >> counts? > >>>>> FYI, my cache size for the Scan is set to 1000. I have experimented > >> with > >>>>> different numbers, but nothing made a noticeable difference. Any > >> advice or > >>>>> feedback would be greatly appreciated! > >>>>> > >>>>> Thanks, > >>>>> Birch > >>> > >>> > >>> Confidentiality Notice: The information contained in this message, > >> including any attachments hereto, may be confidential and is intended > to be > >> read only by the individual or entity to whom this message is > addressed. If > >> the reader of this message is not the intended recipient or an agent or > >> designee of the intended recipient, please note that any review, use, > >> disclosure or distribution of this message or its attachments, in any > form, > >> is strictly prohibited. If you have received this message in error, > please > >> immediately notify the sender and/or [email protected] and > >> delete or destroy any copy of this message and its attachments. > >> > >> > >
