Eran, see my response inline...
On May 10, 2012, at 2:17 PM, Eran Kutner wrote: > Michale I appreciate the feedback but I'd have to disagree. > In my case for example, I need to look at a complete set of data produced > by the map phase in order to make a decision and write it to Hbase. So sure > I could write all the mappers output to hbase then have another map only > job to scan the output of the previous one do the calculation then write > the output to another table. I don't really see why would that be better > than using a reducer. You disagree without actually benchmarking the two? That's pretty bold. :-) 2 things. First Reducers are expensive. Second, writing sorted records in to HBase is also more expensive than if you're writing records in random order. Here's a caveat. I don't know what you're attempting to do, so I can only say in general, I've found it faster to write 2 mappers and avoid using reducers. > As for the other tips, I agree the files are too large, so I increased the > file size, but I don't really see why is that relevant to the error we're > talking about. Why having many regions cause timeouts on HDFS? > I do have mslabs configured and GC tuneups. > I do run multiple reducers, I suspect that's aggravating the problem not > helping it. > As far as I can tell dfs.balance.bandwidthPerSec is relevant only for > balancing done with the balancer, not for the initial replication. > > With respect to the number of regions... you'd probably get a better answer St.Ack or JD. With respect to the bandwidth issue... We set it higher to something like 10% of the available pipe. Not that its going to be used all the time, but the smaller the pipe, the longer it takes to copy a file from one node to another. How much of an impact it has on your performance... Not sure. But its always something to check and think about. BTW, I did a quick read on your problem. You didn't say which release/version of HBase you were running.... > -eran > > > > On Thu, May 10, 2012 at 9:59 PM, Michael Segel > <michael_se...@hotmail.com>wrote: > >> Sigh. >> >> Dave, >> I really think you need to think more about the problem. >> >> Think about what a reduce does and then think about what happens in side >> of HBase. >> >> Then think about which runs faster... a job with two mappers writing the >> intermediate and final results in HBase, >> or a M/R job that writes its output to HBase. >> >> If you really truly think about the problem, you will start to understand >> why I say you really don't want to use a reducer when you're working w >> HBase. >> >> >> On May 10, 2012, at 1:41 PM, Dave Revell wrote: >> >>> Some examples of when you'd want a reducer: >>> http://static.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf >>> >>> On Thu, May 10, 2012 at 11:30 AM, Michael Segel >>> <michael_se...@hotmail.com>wrote: >>> >>>> Dave, do you really want to go there? >>>> >>>> OP has a couple of issues and he was going down a rabbit hole. >>>> (You can choose if that's a reference to 'the Matrix, Jefferson >> Starship, >>>> Alice in Wonderland... or all of the above) >>>> >>>> So to put him on the correct path, I recommended the following, not in >> any >>>> order... >>>> >>>> 1) Increase his region size for this table only. >>>> 2) Look to decreasing the number of regions managed by a RS (which is >> why >>>> you increase region size) >>>> 3) Up the dfs.balance.bandwidthPerSec. (How often does HBase move >> regions >>>> and how exactly do they move regions ?) >>>> 4) Look at implementing MSLABS and GC tuning. This cuts down on the >>>> overhead. >>>> 5) Refactoring his job.... >>>> >>>> Oops. >>>> Ok I didn't put that in the list. >>>> But that was the last thing I wrote as a separate statement. >>>> Clearly you didn't take my advice and think about the problem.... >>>> >>>> To prove a point.... you wrote: >>>> 'Many mapreduce algorithms require a reduce phase (e.g. sorting)' >>>> >>>> Ok. So tell me why you would want to sort your input in to HBase and if >>>> that's really a good thing? >>>> Oops!... :-) >>>> >>>> >>>> >>>> >>>> >>>> >>>> On May 10, 2012, at 12:31 PM, Dave Revell wrote: >>>>> This "you don't need a reducer" conversation is distracting from the >> real >>>>> problem and is false. >>>>> >>>>> Many mapreduce algorithms require a reduce phase (e.g. sorting). The >> fact >>>>> that the output is written to HBase or somewhere else is irrelevant. >>>>> >>>>> -Dave >>>>> >>>>> On Thu, May 10, 2012 at 6:26 AM, Michael Segel < >>>> michael_se...@hotmail.com>wrote: >>>>> [SNIP] >>>> >>>> >> >>