Here are pastebin's of my stacktraces and logs. Note my comment below these links.
regionserver 1 stack trace: http://pastebin.com/0n9cDeYh regionserver 2 stack trace: http://pastebin.com/8Sppp68h regionserver 3 stack trace: http://pastebin.com/qzLEjBN0 regionserver 1 log ~5MB: http://pastebin.com/g3aB5L81 regionserver 2 log ~5MB: http://pastebin.com/NDEaUbJv regionserver 3 log ~5MB: http://pastebin.com/SAVPnr7S zookeeper 1,2,3 log: http://pastebin.com/33RPTHKX So... Am I seeing a deadlock occurring in the regionserver 2 stacktrace? "IPC Server handler 18 on 60020" - Thread t...@65 java.lang.Thread.State: WAITING on java.util.concurrent.locks.reentrantreadwritelock$nonfairs...@99de7de owned by: IPC Server handler 17 on 60020 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:953) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:241) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushSomeRegions(MemStoreFlusher.java:352) - locked org.apache.hadoop.hbase.regionserver.memstoreflus...@4c2fe6bf at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:321) - locked org.apache.hadoop.hbase.regionserver.memstoreflus...@4c2fe6bf at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1775) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) Locked ownable synchronizers: - locked java.util.concurrent.locks.reentrantlock$nonfairs...@5cd62cac - locked java.util.concurrent.locks.reentrantlock$nonfairs...@3cf93af4 "IPC Server handler 17 on 60020" - Thread t...@64 java.lang.Thread.State: BLOCKED on java.util.hash...@1e1b300f owned by: regionserver/192.168.200.32:60020.cacheFlusher at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.request(MemStoreFlusher.java:172) at org.apache.hadoop.hbase.regionserver.HRegion.requestFlush(HRegion.java:1524) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1509) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1292) at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1255) at org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1781) at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) Locked ownable synchronizers: - locked java.util.concurrent.locks.reentrantreadwritelock$nonfairs...@99de7de "regionserver/192.168.200.32:60020.cacheFlusher" - Thread t...@18 java.lang.Thread.State: WAITING on java.util.concurrent.locks.reentrantlock$nonfairs...@5cd62cac owned by: IPC Server handler 18 on 60020 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:235) - locked java.util.hash...@1e1b300f at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:149) Locked ownable synchronizers: - None On 7/16/10 6:34 PM, "Ryan Rawson" <[email protected]> wrote: According to Todd, there is some kind of weird Thread coordination issue which is worked around by setting the timeout to 0, even though we actually arent hitting any timeouts in the failure case. And it might have been fixed in cdh3. I haven't had chance to run it yet so I can't say. -ryan On Fri, Jul 16, 2010 at 3:32 PM, Stack <[email protected]> wrote: > So, it seems like you are by-passing issue by having no time out on > the socket. Would be for sure interested though if you have the issue > still on cdh3b2. Most folks will not be running with no socket > timeout. > > Thanks Luke. > St.Ack > > > On Fri, Jul 16, 2010 at 3:01 PM, Luke Forehand > <[email protected]> wrote: >> Using Ryan Rawson's suggested config tweaks, we have just completed a >> successful job run with a 15GB sequence file, no hang. I'm setting up to >> have multiple files process this weekend with the new settings. :-) I >> believe the dfs socket write timeout being indefinite was the trick. >> >> I'll post my results on Monday. Thanks for the support thus far! >> >> -Luke >>
