Re: JVM OOM

Stack Wed, 05 Jan 2011 15:16:41 -0800

Wayne:

In case you didn't see it, one of the lads had a suggestion over in
https://issues.apache.org/jira/browse/HBASE-3421.


St.Ack

On Wed, Jan 5, 2011 at 1:33 PM, Stack <[email protected]> wrote:
> k.  Thanks for stack trace.  Its unrelated to compactions but that
> don't mean compaction not responsible.
>
> I'm filing an issue for wide rows causing OOME (HBASE-3421).  Maybe
> its not the case but we should check it out at least because I know of
> others who have schemas where the row width can grow unbounded.
>
> Thanks,
> St.Ack
>
>
> On Wed, Jan 5, 2011 at 1:08 PM, Wayne <[email protected]> wrote:
>> Assuming it was big rows I dropped the table and have started loading it
>> again (I could not share the data anyway). We will see it again eventually
>> if it is not the large rows. We have changed our data model to be "taller"
>> so the 30 million column row is now gone for good.
>>
>> Below is the contents of the .out file from the first OOM. I found the hprof
>> files, but I have no public place to put them. At 10GB I can not even get
>> them to my laptop to look at either.
>>
>> Thanks.
>>
>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor2]
>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor3]
>> [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor1]
>> [Unloading class sun.reflect.GeneratedConstructorAccessor10]
>> [Unloading class sun.reflect.GeneratedConstructorAccessor21]
>> [Unloading class sun.reflect.GeneratedConstructorAccessor23]
>> [Unloading class sun.reflect.GeneratedMethodAccessor17]
>> [Unloading class sun.reflect.GeneratedMethodAccessor1]
>> [Unloading class sun.reflect.GeneratedMethodAccessor19]
>> [Unloading class sun.reflect.GeneratedMethodAccessor18]
>> [Unloading class sun.reflect.GeneratedMethodAccessor16]
>> java.lang.OutOfMemoryError: Java heap space
>> Dumping heap to java_pid17874.hprof ...
>> Exception in thread "pool-1-thread-6" java.lang.OutOfMemoryError: Java
>> heap space
>>        at org.apache.hadoop.hbase.client.Put.readFields(Put.java:500)
>> Heap dump file created [9689854890 bytes in 151.877 secs]
>>        at 
>> org.apache.hadoop.hbase.client.MultiPut.readFields(MultiPut.java:111)
>>        at 
>> org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:461)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:124)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>> Source)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>        at java.lang.Thread.run(Unknown Source)
>> Exception in thread "ResponseProcessor for block
>> blk_2176268114489978801_654636" java.lang.OutOfMemoryError: Java heap
>> space
>>        at java.util.HashMap.newKeyIterator(Unknown Source)
>>        at java.util.HashMap$KeySet.iterator(Unknown Source)
>>        at java.util.HashSet.iterator(Unknown Source)
>>        at sun.nio.ch.SelectorImpl.processDeregisterQueue(Unknown Source)
>>        at sun.nio.ch.EPollSelectorImpl.doSelect(Unknown Source)
>>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source)
>>        at sun.nio.ch.SelectorImpl.select(Unknown Source)
>>        at 
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
>>        at 
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>        at 
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>        at 
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>        at java.io.DataInputStream.readFully(Unknown Source)
>>        at java.io.DataInputStream.readLong(Unknown Source)
>>        at 
>> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:122)
>>        at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2547)
>> Exception in thread "pool-1-thread-7" java.lang.OutOfMemoryError: Java
>> heap space
>>        at org.apache.hadoop.hbase.client.Put.readFields(Put.java:495)
>>        at 
>> org.apache.hadoop.hbase.client.MultiPut.readFields(MultiPut.java:111)
>>        at 
>> org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:461)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:124)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>> Source)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>        at java.lang.Thread.run(Unknown Source)
>> Exception in thread "pool-1-thread-9" java.lang.OutOfMemoryError: Java
>> heap space
>>
>>
>>
>> On Wed, Jan 5, 2011 at 3:36 PM, Stack <[email protected]> wrote:
>>
>>> That could be it.  Do you know names of participating files?  The
>>> should be printed in the log.  Can you save aside that Store, tar it
>>> up and put it somewhere we can pull (if data is ok for others to see)?
>>>  If we are OOME'ing on big rows, thats something I know some fellas
>>> for sure would be interested in fixing.
>>>
>>> (There is actually a 'limit' on number of kvs to get at a time when
>>> scanning unused by the compaction code -- perhaps thats all it would
>>> take to fix this issue?)
>>>
>>> Its highly unlikely but I'll ask anyways, was the OOME in same
>>> location both times?  (OOME stack trace may be in the .out file rather
>>> than the .log file).
>>>
>>> Location of the hprof is usually where the program was launched from
>>> (check $HBASE_HOME dir).
>>>
>>> St.Ack
>>>
>>>
>>> On Wed, Jan 5, 2011 at 11:24 AM, Wayne <[email protected]> wrote:
>>> > Pretty sure this is compaction. The same node OOME again along with
>>> another
>>> > node after starting compaction. Like cass* .6 I guess hbase can not
>>> handle a
>>> > row bigger than it can hold in memory. I always read a lot about big
>>> cells
>>> > being a problem, but this problem is big rows.
>>> >
>>> > Thanks.
>>> >
>>> > On Wed, Jan 5, 2011 at 12:13 PM, Wayne <[email protected]> wrote:
>>> >
>>> >> It was carrying ~9k writes/sec and has been for the last 24+ hours.
>>> There
>>> >> are 500+ regions on that node. I could not find the heap dump
>>> (location?)
>>> >> but we do have some errant big rows that have crashed before. When we
>>> query
>>> >> those big rows thrift has been crashing. Maybe major compaction kicked
>>> in
>>> >> for those rows (see last log entry below)? There are 30 million columns
>>> with
>>> >> all small cell values but the 30 million is definitely too much.
>>> >>
>>> >> Here are some errors from the hadoop log. It looks like it kept getting
>>> >> stuck on something which may point to the data being too big? The error
>>> >> below occured 12 times in a row.
>>> >>
>>> >> org.apache.hadoop.ipc.RemoteException: java.io.IOException:
>>> >> blk_2176268114489978801_654636 is already commited, storedBlock == null.
>>> >>
>>> >> Here is the entry from the HBase log.
>>> >>
>>> >> 15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>>> Exception:
>>> >> java.io.IOException: Broken pipe
>>> >> 15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer:
>>> >> Aborting region server serverName=sacdnb08.dataraker.net
>>> ,60020,1294089592450,
>>> >> load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught
>>> >> exception in service thread regionserver60020.compactor
>>> >> java.lang.OutOfMemoryError: Java heap space
>>> >>
>>> >> Thanks.
>>> >>
>>> >>
>>> >> On Wed, Jan 5, 2011 at 11:45 AM, Stack <[email protected]> wrote:
>>> >>
>>> >>> What was the server carrying?  How many regions?  What kinda of
>>> >>> loading was on the cluster?  We should not be OOME'ing.  Do you have
>>> >>> the heap dump lying around (We dump heap on OOME... its named *.hprof
>>> >>> or something.  If you have it, want to put it somewhere for me to pull
>>> >>> it so I can take a look?).  Any chance of a errant big cells?  Lots of
>>> >>> them?  What JVM version?
>>> >>>
>>> >>> St.Ack
>>> >>>
>>> >>> On Wed, Jan 5, 2011 at 8:10 AM, Wayne <[email protected]> wrote:
>>> >>> > I am still struggling with the JVM. We just had a hard OOM crash of a
>>> >>> region
>>> >>> > server after only running for 36 hours. Any help would be greatly
>>> >>> > appreciated. Do we need to restart nodes every 24 hours under load?
>>>  GC
>>> >>> > Pauses are something we are trying to plan for, but full out OOM
>>> crashes
>>> >>> are
>>> >>> > a new problem.
>>> >>> >
>>> >>> > The message below seems to be where it starts going bad. It is
>>> followed
>>> >>> by
>>> >>> > no less than 63 Concurrent Mode Failure errors over a 16 minute
>>> period.
>>> >>> >
>>> >>> > *GC locker: Trying a full collection because scavenge failed*
>>> >>> >
>>> >>> > Lastly here is the end (after the 63 CMF errors).
>>> >>> >
>>> >>> > Heap
>>> >>> >  par new generation   total 1887488K, used 303212K
>>> [0x00000005fae00000,
>>> >>> > 0x000000067ae00000, 0x000000067ae00000)
>>> >>> >  eden space 1677824K,  18% used [0x00000005fae00000,
>>> 0x000000060d61b078,
>>> >>> > 0x0000000661480000)
>>> >>> >  from space 209664K,   0% used [0x000000066e140000,
>>> 0x000000066e140000,
>>> >>> > 0x000000067ae00000)
>>> >>> >  to   space 209664K,   0% used [0x0000000661480000,
>>> 0x0000000661480000,
>>> >>> > 0x000000066e140000)
>>> >>> >  concurrent mark-sweep generation total 6291456K, used 2440155K
>>> >>> > [0x000000067ae00000, 0x00000007fae00000, 0x00000007fae00000)
>>> >>> >  concurrent-mark-sweep perm gen total 31704K, used 18999K
>>> >>> > [0x00000007fae00000, 0x00000007fccf6000, 0x0000000800000000)
>>> >>> >
>>> >>> > Here again are our custom settings in case there are some suggestions
>>> >>> out
>>> >>> > there. Are we making it worse with these settings? What should we try
>>> >>> next?
>>> >>> >
>>> >>> >        -XX:+UseCMSInitiatingOccupancyOnly
>>> >>> >        -XX:CMSInitiatingOccupancyFraction=60
>>> >>> >        -XX:+CMSParallelRemarkEnabled
>>> >>> >        -XX:SurvivorRatio=8
>>> >>> >        -XX:NewRatio=3
>>> >>> >        -XX:MaxTenuringThreshold=1
>>> >>> >
>>> >>> >
>>> >>> > Thanks!
>>> >>> >
>>> >>>
>>> >>
>>> >>
>>> >
>>>
>>
>

Re: JVM OOM

Reply via email to