Re: HBase issues since upgrade from 0.92.4 to 0.94.6

David Koch Wed, 17 Jul 2013 01:06:27 -0700

In the end we increased the heap allocation for HBase region servers to 4GB
(from it's default 1GB) and it seems to work now.



On Mon, Jul 15, 2013 at 1:28 PM, Jamal B <[email protected]> wrote:

> I believe that your workload after the upgrade caused the process to exceed
> it's 1 GB memory allocation, and your jvm flag -XX:OnOutOfMemoryError=kill
> -9 %p worked as expected and killed it.  I would remove the kill hook, or
> at least put out some sort of log entry to the syslog or something before
> it kills the pid, otherwise you have no log entry to point back to when the
> pid abruptly dies, like in this case.
>
> Also, someone please correct me if I'm wrong, but I thought that the
> hbase.hregion.max.filesize config property does not enforce the max size of
> a region, but only a max size before compaction is required.
>
>
> On Fri, Jul 12, 2013 at 12:15 PM, David Koch <[email protected]>
> wrote:
>
> > Hello,
> >
> > This is the command that is used to launch the region servers:
> >
> > /usr/java/jdk1.7.0_25/bin/java -XX:OnOutOfMemoryError=kill -9 %p
> -Xmx1000m
> > -Djava.net.preferIPv4Stack=true -Xmx1073741824 -XX:+UseParNewGC
> > -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
> > -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> > -Dhbase.log.dir=/var/log/hbase
> > -Dhbase.log.file=hbase-cmf-hbase1-REGIONSERVER-big-4.ezakus.net.log.out
> >
> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hbase
> > -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA -Djava.library.path=<...
> libs
> > ...>
> >
> > so garbage collection logging is not activated it seems. I can try and
> > re-launch with the -verbose:gc flag
> >
> > All HBase settings are left at their (CDH 4.3) default for example:
> > hfile.block.cache.size=0.25
> > hbase.hregion.max.filesize=1GB
> >
> > except:
> > hbase.hregion.majorcompaction=0
> >
> > speculative execution is off.
> >
> > The only solution we have found so far is lowering the workload by
> running
> > less jobs in parallel.
> >
> > /David
> >
> >
> > On Fri, Jul 12, 2013 at 1:48 PM, Azuryy Yu <[email protected]> wrote:
> >
> > > I do think your JVM on the RS crashed. do you have GC log?
> > >
> > > do you set MR *mapred*.map.tasks.*speculative.execution=false *when you
> > > using map jobs to read or write HBASE?
> > >
> > > and if you have a heavy read/write load, how did you tune the HBase?
> such
> > > as block cache size, compaction, memstore etc.
> > >
> > >
> > > On Fri, Jul 12, 2013 at 7:42 PM, David Koch <[email protected]>
> > wrote:
> > >
> > > > Thank you for your responses. With respect to the version of Java I
> > found
> > > > that Cloudera recommend<
> > > >
> > >
> >
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_3.html
> > > > >1.7.x
> > > > for CDH4.3.
> > > >
> > > >
> > > > On Fri, Jul 12, 2013 at 1:32 PM, Jean-Marc Spaggiari <
> > > > [email protected]> wrote:
> > > >
> > > > > Might want to run memtest also, just to be sure there is no memory
> > > issue.
> > > > > It should not since it was working fine with 0.92.4, but costs
> > > nothing...
> > > > >
> > > > > the last version of Java 6 is 45... Might also worst to give it a
> try
> > > if
> > > > > you are running with 1.6.
> > > > >
> > > > > 2013/7/12 Asaf Mesika <[email protected]>
> > > > >
> > > > > > You need to see the jvm crash in .out log file and see if maybe
> its
> > > the
> > > > > .so
> > > > > > native Hadoop code that making the problem. In our case we
> > > > > > Downgraded from jvm 1.6.0-37 to 33 and it solved the issue.
> > > > > >
> > > > > >
> > > > > > On Friday, July 12, 2013, David Koch wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > NOTE: I posted the same message in the the Cloudera group.
> > > > > > >
> > > > > > > Since upgrading from CDH 4.0.1 (HBase 0.92.4) to 4.3.0 (HBase
> > > 0.94.6)
> > > > > we
> > > > > > > systematically experience problems with region servers crashing
> > > > > silently
> > > > > > > under workloads which used to pass without problems. More
> > > > specifically,
> > > > > > we
> > > > > > > run about 30 Mapper jobs in parallel which read from HDFS and
> > > insert
> > > > in
> > > > > > > HBase.
> > > > > > >
> > > > > > > region server log
> > > > > > > NOTE: no trace of crash, but server is down and shows up as
> such
> > in
> > > > > > > Cloudera Manager.
> > > > > > >
> > > > > > > 2013-07-12 10:22:12,050 WARN
> > > > > > > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286
> > > > > > > might be still open, length is 0
> > > > > > > 2013-07-12 10:22:12,051 INFO
> > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > Recovering file
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX
> > > > > > > t%2C60020%2C1373616547696.1373617004286
> > > > > > > 2013-07-12 10:22:13,064 INFO
> > > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > > Finished lease recover attempt for
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286
> > > > > > > 2013-07-12 10:22:14,819 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > 2013-07-12 10:22:14,824 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > ...
> > > > > > > 2013-07-12 10:22:14,850 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > 2013-07-12 10:22:15,530 INFO
> > > org.apache.hadoop.io.compress.CodecPool:
> > > > > Got
> > > > > > > brand-new compressor [.deflate]
> > > > > > > < -- last log entry, region server is down here -- >
> > > > > > >
> > > > > > >
> > > > > > > datanode log, same machine
> > > > > > >
> > > > > > > 2013-07-12 10:22:04,811 ERROR
> > > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > > > > XXXXXXX:50010:DataXceiver
> > > > > > > error processing WRITE_BLOCK operation  src:
> /YYY.YY.YYY.YY:36024
> > > > dest:
> > > > > > > /XXX.XX.XXX.XX:50010
> > > > > > > java.io.IOException: Premature EOF from inputStream
> > > > > > > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)
> > > > > > > at
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> > > > > > > at java.lang.Thread.run(Thread.java:724)
> > > > > > > < -- many repetitions of this -- >
> > > > > > >
> > > > > > > What could have caused this difference in stability?
> > > > > > >
> > > > > > > We did not change any configuration settings with respect to
> the
> > > > > previous
> > > > > > > CDH 4.0.1 setup. In particular, we left ulimit and
> > > > > > > dfs.datanode.max.xcievers at 32k. If need be, I can provide
> more
> > > > > complete
> > > > > > > log/configuration information.
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > /David
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HBase issues since upgrade from 0.92.4 to 0.94.6

Reply via email to