This is a hang with timeout set to 0 but on CDH2? St.Ack
On Sun, Jul 18, 2010 at 1:36 PM, Luke Forehand <[email protected]> wrote: > I experienced the hang on my second job attempt. I will be pastebinning > stacktraces and logs of all three servers tonight. The datanode log of one > of the servers is way bigger than the rest and that's all the analysis I've > done so far. Meeting with cloudera on Monday and they'll probably want me to > migrate to CDH3. Need to mow the lawn... I'll report back soon. > > -Luke > > On 7/16/10 6:34 PM, "Ryan Rawson" <[email protected]> wrote: > > According to Todd, there is some kind of weird Thread coordination > issue which is worked around by setting the timeout to 0, even though > we actually arent hitting any timeouts in the failure case. > > And it might have been fixed in cdh3. I haven't had chance to run it > yet so I can't say. > > -ryan > > On Fri, Jul 16, 2010 at 3:32 PM, Stack <[email protected]> wrote: >> So, it seems like you are by-passing issue by having no time out on >> the socket. Would be for sure interested though if you have the issue >> still on cdh3b2. Most folks will not be running with no socket >> timeout. >> >> Thanks Luke. >> St.Ack >> >> >> On Fri, Jul 16, 2010 at 3:01 PM, Luke Forehand >> <[email protected]> wrote: >>> Using Ryan Rawson's suggested config tweaks, we have just completed a >>> successful job run with a 15GB sequence file, no hang. I'm setting up to >>> have multiple files process this weekend with the new settings. :-) I >>> believe the dfs socket write timeout being indefinite was the trick. >>> >>> I'll post my results on Monday. Thanks for the support thus far! >>> >>> -Luke >>> >>> On 7/15/10 10:17 PM, "Ryan Rawson" <[email protected]> wrote: >>> >>> I'm not seeing anything in that logfile, you are seeing compactions >>> for various regions, but im not seeing flushes (typical during insert >>> loads) and nothing else. One thing we look to see is a log message >>> "Blocking updates" which indicates that a particular region has >>> decided it's holding up to prevent taking too many inserts. >>> >>> Like I said, you could be seeing this on a different regionserver, if >>> all the clients are blocked on 1 regionserver and can't get to the >>> others then most will look idle and only one will actually show >>> anything interesting in the log. >>> >>> Can you check for this behaviour? >>> >>> Also if you want to tweak the config with the values I pasted that should >>> help. >>> >>> On Thu, Jul 15, 2010 at 7:25 PM, Luke Forehand >>> <[email protected]> wrote: >>>> It looks like we are going straight from the default config, no expicit >>>> setting of anything. >>>> >>>> On 7/15/10 9:03 PM, "Ryan Rawson" <[email protected]> wrote: >>>> >>>> In this case the regionserver isn't actually doing anything - all the >>>> IPC thread handlers are waiting in their queue handoff thingy (how >>>> they get socket/work to do). >>>> >>>> Something elsewhere perhaps? Check the logs of your jobs, there might >>>> be something interesting there. >>>> >>>> One thing that frequently happens is you overrun 1 regionserver with >>>> edits and it isnt flushing fast enough, so it pauses updates and all >>>> clients end up stuck on it. >>>> >>>> What was that config again? I use these settings: >>>> >>>> <property> >>>> <name>hbase.hstore.blockingStoreFiles</name> >>>> <value>15</value> >>>> </property> >>>> >>>> <property> >>>> <name>dfs.datanode.socket.write.timeout</name> >>>> <value>0</value> >>>> </property> >>>> >>>> <property> >>>> <name>hbase.hregion.memstore.block.multiplier</name> >>>> <value>8</value> >>>> </property> >>>> >>>> perhaps try these ones? >>>> >>>> -ryan >>> >> >
