Region server log snippet was for 09:11:04 while data node log was for 00:02. Do you observe similar warning around 09:11 in data node log ?
BTW 0.90 release was 3 major releases behind. Please consider upgrading. Cheers On Wed, Nov 26, 2014 at 1:43 PM, Adam Wilhelm <[email protected]> wrote: > We are running an 80 node cluster: > Hdfs version: 0.20.2-cdh3u5 > Hbase version: 0.90.6-cdh3u5 > > The issue we have is that infrequently region servers are crashing. So far > it has been once a week, not on the same day or time. > > The error we are getting in RegionServer logs is: > > 2014-11-26 09:11:04,460 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > serverName=hd073.xxxxxxxx,60020,1407311682582, load=(requests=0, > regions=227, usedHeap=9293, maxHeap=12250): IOE in log roller > java.io.IOException: cannot get log writer > at > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:677) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriterInstance(HLog.java:624) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.rollWriter(HLog.java:560) > at > org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:96) > Caused by: java.io.IOException: java.io.IOException: Call to > %NAMENODE%:8020 failed on local exception: java.io.IOException: Connection > reset by peer > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:106) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.createWriter(HLog.java:674) > ... 3 more > Caused by: java.io.IOException: Call to %NAMENODE%:8020 failed on local > exception: java.io.IOException: Connection reset by peer > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187) > at org.apache.hadoop.ipc.Client.call(Client.java:1155) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at $Proxy7.create(Unknown Source) > at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy7.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:3417) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:751) > at > org.apache.hadoop.hdfs.DistributedFileSystem.createNonRecursive(DistributedFileSystem.java:200) > at > org.apache.hadoop.fs.FileSystem.createNonRecursive(FileSystem.java:653) > at > org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:444) > at sun.reflect.GeneratedMethodAccessor364.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.init(SequenceFileLogWriter.java:87) > ... 4 more > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcher.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) > at sun.nio.ch.IOUtil.read(IOUtil.java:175) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at java.io.FilterInputStream.read(FilterInputStream.java:116) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:376) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > at java.io.DataInputStream.readInt(DataInputStream.java:370) > at > org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:858) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:767) > 2014-11-26 09:11:04,460 WARN org.apache.hadoop.hdfs.DFSClient: > DataStreamer Exception: java.io.IOException: Call to %NAMENODE%:8020 failed > on local exception: java.io.IOException: Connection reset by peer > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1187) > at org.apache.hadoop.ipc.Client.call(Client.java:1155) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at $Proxy7.addBlock(Unknown Source) > at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > at $Proxy7.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3719) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3586) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2400(DFSClient.java:2792) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2987) > > The servers aren't under any major load but they appear to be having > issues communicating to the namenode. There are what appear to be > corresponding errors in the DataNode log. Thos look like: > > 2014-11-26 00:02:15,423 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.100.2.76:50010, > storageID=DS-562360767-10.100.2.76-50010-1358397869707, infoPort=50075, > ipcPort=50020):Got exception while serving > blk_-5442848061718769346_625833634 to /10.100.2.76: > java.net.SocketTimeoutException: 480000 millis timeout while waiting for > channel to be ready for write. ch : > java.nio.channels.SocketChannel[connected local=/10.100.2.76:50010 > remote=/10.100.2.76:55462] > at > org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) > at > org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175) > > > What I am having trouble proving and then making an educated guess on > resolving is whether this issue is an actual communication issue with the > NameNode server due to issues with that server or the issue I have is local > write issues and timeouts are due to local resource issues on the > DataNode/RegionServer local server. > > We are running RS, DN, and TT on each of the worker server. > > Any insight or suggestions would be much appreciated. > > Thanks, > > > Adam Wilhelm > >
