I do think your JVM on the RS crashed. do you have GC log? do you set MR *mapred*.map.tasks.*speculative.execution=false *when you using map jobs to read or write HBASE?
and if you have a heavy read/write load, how did you tune the HBase? such as block cache size, compaction, memstore etc. On Fri, Jul 12, 2013 at 7:42 PM, David Koch <[email protected]> wrote: > Thank you for your responses. With respect to the version of Java I found > that Cloudera recommend< > http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_3.html > >1.7.x > for CDH4.3. > > > On Fri, Jul 12, 2013 at 1:32 PM, Jean-Marc Spaggiari < > [email protected]> wrote: > > > Might want to run memtest also, just to be sure there is no memory issue. > > It should not since it was working fine with 0.92.4, but costs nothing... > > > > the last version of Java 6 is 45... Might also worst to give it a try if > > you are running with 1.6. > > > > 2013/7/12 Asaf Mesika <[email protected]> > > > > > You need to see the jvm crash in .out log file and see if maybe its the > > .so > > > native Hadoop code that making the problem. In our case we > > > Downgraded from jvm 1.6.0-37 to 33 and it solved the issue. > > > > > > > > > On Friday, July 12, 2013, David Koch wrote: > > > > > > > Hello, > > > > > > > > NOTE: I posted the same message in the the Cloudera group. > > > > > > > > Since upgrading from CDH 4.0.1 (HBase 0.92.4) to 4.3.0 (HBase 0.94.6) > > we > > > > systematically experience problems with region servers crashing > > silently > > > > under workloads which used to pass without problems. More > specifically, > > > we > > > > run about 30 Mapper jobs in parallel which read from HDFS and insert > in > > > > HBase. > > > > > > > > region server log > > > > NOTE: no trace of crash, but server is down and shows up as such in > > > > Cloudera Manager. > > > > > > > > 2013-07-12 10:22:12,050 WARN > > > > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File > > > > > > > > > > > > > > hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286 > > > > might be still open, length is 0 > > > > 2013-07-12 10:22:12,051 INFO > org.apache.hadoop.hbase.util.FSHDFSUtils: > > > > Recovering file > > > > > > > > > > > > > > hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX > > > > t%2C60020%2C1373616547696.1373617004286 > > > > 2013-07-12 10:22:13,064 INFO > org.apache.hadoop.hbase.util.FSHDFSUtils: > > > > Finished lease recover attempt for > > > > > > > > > > > > > > hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286 > > > > 2013-07-12 10:22:14,819 INFO org.apache.hadoop.io.compress.CodecPool: > > Got > > > > brand-new compressor [.deflate] > > > > 2013-07-12 10:22:14,824 INFO org.apache.hadoop.io.compress.CodecPool: > > Got > > > > brand-new compressor [.deflate] > > > > ... > > > > 2013-07-12 10:22:14,850 INFO org.apache.hadoop.io.compress.CodecPool: > > Got > > > > brand-new compressor [.deflate] > > > > 2013-07-12 10:22:15,530 INFO org.apache.hadoop.io.compress.CodecPool: > > Got > > > > brand-new compressor [.deflate] > > > > < -- last log entry, region server is down here -- > > > > > > > > > > > > > datanode log, same machine > > > > > > > > 2013-07-12 10:22:04,811 ERROR > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: > > > XXXXXXX:50010:DataXceiver > > > > error processing WRITE_BLOCK operation src: /YYY.YY.YYY.YY:36024 > dest: > > > > /XXX.XX.XXX.XX:50010 > > > > java.io.IOException: Premature EOF from inputStream > > > > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67) > > > > at > > > > > > > > > > > > > > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221) > > > > at java.lang.Thread.run(Thread.java:724) > > > > < -- many repetitions of this -- > > > > > > > > > What could have caused this difference in stability? > > > > > > > > We did not change any configuration settings with respect to the > > previous > > > > CDH 4.0.1 setup. In particular, we left ulimit and > > > > dfs.datanode.max.xcievers at 32k. If need be, I can provide more > > complete > > > > log/configuration information. > > > > > > > > Thank you, > > > > > > > > /David > > > > > > > > > >
