According to Todd, there is some kind of weird Thread coordination issue which is worked around by setting the timeout to 0, even though we actually arent hitting any timeouts in the failure case.
And it might have been fixed in cdh3. I haven't had chance to run it yet so I can't say. -ryan On Fri, Jul 16, 2010 at 3:32 PM, Stack <[email protected]> wrote: > So, it seems like you are by-passing issue by having no time out on > the socket. Would be for sure interested though if you have the issue > still on cdh3b2. Most folks will not be running with no socket > timeout. > > Thanks Luke. > St.Ack > > > On Fri, Jul 16, 2010 at 3:01 PM, Luke Forehand > <[email protected]> wrote: >> Using Ryan Rawson's suggested config tweaks, we have just completed a >> successful job run with a 15GB sequence file, no hang. I'm setting up to >> have multiple files process this weekend with the new settings. :-) I >> believe the dfs socket write timeout being indefinite was the trick. >> >> I'll post my results on Monday. Thanks for the support thus far! >> >> -Luke >> >> On 7/15/10 10:17 PM, "Ryan Rawson" <[email protected]> wrote: >> >> I'm not seeing anything in that logfile, you are seeing compactions >> for various regions, but im not seeing flushes (typical during insert >> loads) and nothing else. One thing we look to see is a log message >> "Blocking updates" which indicates that a particular region has >> decided it's holding up to prevent taking too many inserts. >> >> Like I said, you could be seeing this on a different regionserver, if >> all the clients are blocked on 1 regionserver and can't get to the >> others then most will look idle and only one will actually show >> anything interesting in the log. >> >> Can you check for this behaviour? >> >> Also if you want to tweak the config with the values I pasted that should >> help. >> >> On Thu, Jul 15, 2010 at 7:25 PM, Luke Forehand >> <[email protected]> wrote: >>> It looks like we are going straight from the default config, no expicit >>> setting of anything. >>> >>> On 7/15/10 9:03 PM, "Ryan Rawson" <[email protected]> wrote: >>> >>> In this case the regionserver isn't actually doing anything - all the >>> IPC thread handlers are waiting in their queue handoff thingy (how >>> they get socket/work to do). >>> >>> Something elsewhere perhaps? Check the logs of your jobs, there might >>> be something interesting there. >>> >>> One thing that frequently happens is you overrun 1 regionserver with >>> edits and it isnt flushing fast enough, so it pauses updates and all >>> clients end up stuck on it. >>> >>> What was that config again? I use these settings: >>> >>> <property> >>> <name>hbase.hstore.blockingStoreFiles</name> >>> <value>15</value> >>> </property> >>> >>> <property> >>> <name>dfs.datanode.socket.write.timeout</name> >>> <value>0</value> >>> </property> >>> >>> <property> >>> <name>hbase.hregion.memstore.block.multiplier</name> >>> <value>8</value> >>> </property> >>> >>> perhaps try these ones? >>> >>> -ryan >> >
