In the end we increased the heap allocation for HBase region servers to 4GB
(from it's default 1GB) and it seems to work now.


On Mon, Jul 15, 2013 at 1:28 PM, Jamal B <[email protected]> wrote:

> I believe that your workload after the upgrade caused the process to exceed
> it's 1 GB memory allocation, and your jvm flag -XX:OnOutOfMemoryError=kill
> -9 %p worked as expected and killed it.  I would remove the kill hook, or
> at least put out some sort of log entry to the syslog or something before
> it kills the pid, otherwise you have no log entry to point back to when the
> pid abruptly dies, like in this case.
>
> Also, someone please correct me if I'm wrong, but I thought that the
> hbase.hregion.max.filesize config property does not enforce the max size of
> a region, but only a max size before compaction is required.
>
>
> On Fri, Jul 12, 2013 at 12:15 PM, David Koch <[email protected]>
> wrote:
>
> > Hello,
> >
> > This is the command that is used to launch the region servers:
> >
> > /usr/java/jdk1.7.0_25/bin/java -XX:OnOutOfMemoryError=kill -9 %p
> -Xmx1000m
> > -Djava.net.preferIPv4Stack=true -Xmx1073741824 -XX:+UseParNewGC
> > -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
> > -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> > -Dhbase.log.dir=/var/log/hbase
> > -Dhbase.log.file=hbase-cmf-hbase1-REGIONSERVER-big-4.ezakus.net.log.out
> >
> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hbase
> > -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA -Djava.library.path=<...
> libs
> > ...>
> >
> > so garbage collection logging is not activated it seems. I can try and
> > re-launch with the -verbose:gc flag
> >
> > All HBase settings are left at their (CDH 4.3) default for example:
> > hfile.block.cache.size=0.25
> > hbase.hregion.max.filesize=1GB
> >
> > except:
> > hbase.hregion.majorcompaction=0
> >
> > speculative execution is off.
> >
> > The only solution we have found so far is lowering the workload by
> running
> > less jobs in parallel.
> >
> > /David
> >
> >
> > On Fri, Jul 12, 2013 at 1:48 PM, Azuryy Yu <[email protected]> wrote:
> >
> > > I do think your JVM on the RS crashed. do you have GC log?
> > >
> > > do you set MR *mapred*.map.tasks.*speculative.execution=false *when you
> > > using map jobs to read or write HBASE?
> > >
> > > and if you have a heavy read/write load, how did you tune the HBase?
> such
> > > as block cache size, compaction, memstore etc.
> > >
> > >
> > > On Fri, Jul 12, 2013 at 7:42 PM, David Koch <[email protected]>
> > wrote:
> > >
> > > > Thank you for your responses. With respect to the version of Java I
> > found
> > > > that Cloudera recommend<
> > > >
> > >
> >
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_3.html
> > > > >1.7.x
> > > > for CDH4.3.
> > > >
> > > >
> > > > On Fri, Jul 12, 2013 at 1:32 PM, Jean-Marc Spaggiari <
> > > > [email protected]> wrote:
> > > >
> > > > > Might want to run memtest also, just to be sure there is no memory
> > > issue.
> > > > > It should not since it was working fine with 0.92.4, but costs
> > > nothing...
> > > > >
> > > > > the last version of Java 6 is 45... Might also worst to give it a
> try
> > > if
> > > > > you are running with 1.6.
> > > > >
> > > > > 2013/7/12 Asaf Mesika <[email protected]>
> > > > >
> > > > > > You need to see the jvm crash in .out log file and see if maybe
> its
> > > the
> > > > > .so
> > > > > > native Hadoop code that making the problem. In our case we
> > > > > > Downgraded from jvm 1.6.0-37 to 33 and it solved the issue.
> > > > > >
> > > > > >
> > > > > > On Friday, July 12, 2013, David Koch wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > NOTE: I posted the same message in the the Cloudera group.
> > > > > > >
> > > > > > > Since upgrading from CDH 4.0.1 (HBase 0.92.4) to 4.3.0 (HBase
> > > 0.94.6)
> > > > > we
> > > > > > > systematically experience problems with region servers crashing
> > > > > silently
> > > > > > > under workloads which used to pass without problems. More
> > > > specifically,
> > > > > > we
> > > > > > > run about 30 Mapper jobs in parallel which read from HDFS and
> > > insert
> > > > in
> > > > > > > HBase.
> > > > > > >
> > > > > > > region server log
> > > > > > > NOTE: no trace of crash, but server is down and shows up as
> such
> > in
> > > > > > > Cloudera Manager.
> > > > > > >
> > > > > > > 2013-07-12 10:22:12,050 WARN
> > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286
> > > > > > > might be still open, length is 0
> > > > > > > 2013-07-12 10:22:12,051 INFO
> > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > Recovering file
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX
> > > > > > > t%2C60020%2C1373616547696.1373617004286
> > > > > > > 2013-07-12 10:22:13,064 INFO
> > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > Finished lease recover attempt for
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286
> > > > > > > 2013-07-12 10:22:14,819 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > 2013-07-12 10:22:14,824 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > ...
> > > > > > > 2013-07-12 10:22:14,850 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > 2013-07-12 10:22:15,530 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > < -- last log entry, region server is down here -- >
> > > > > > >
> > > > > > >
> > > > > > > datanode log, same machine
> > > > > > >
> > > > > > > 2013-07-12 10:22:04,811 ERROR
> > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > > > > XXXXXXX:50010:DataXceiver
> > > > > > > error processing WRITE_BLOCK operation  src:
> /YYY.YY.YYY.YY:36024
> > > > dest:
> > > > > > > /XXX.XX.XXX.XX:50010
> > > > > > > java.io.IOException: Premature EOF from inputStream
> > > > > > > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> > > > > > > at java.lang.Thread.run(Thread.java:724)
> > > > > > > < -- many repetitions of this -- >
> > > > > > >
> > > > > > > What could have caused this difference in stability?
> > > > > > >
> > > > > > > We did not change any configuration settings with respect to
> the
> > > > > previous
> > > > > > > CDH 4.0.1 setup. In particular, we left ulimit and
> > > > > > > dfs.datanode.max.xcievers at 32k. If need be, I can provide
> more
> > > > > complete
> > > > > > > log/configuration information.
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > /David
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to