Re: JVM OOM

Wayne Wed, 05 Jan 2011 11:25:27 -0800

Pretty sure this is compaction. The same node OOME again along with another
node after starting compaction. Like cass* .6 I guess hbase can not handle a
row bigger than it can hold in memory. I always read a lot about big cells
being a problem, but this problem is big rows.


Thanks.

On Wed, Jan 5, 2011 at 12:13 PM, Wayne <[email protected]> wrote:

> It was carrying ~9k writes/sec and has been for the last 24+ hours. There
> are 500+ regions on that node. I could not find the heap dump (location?)
> but we do have some errant big rows that have crashed before. When we query
> those big rows thrift has been crashing. Maybe major compaction kicked in
> for those rows (see last log entry below)? There are 30 million columns with
> all small cell values but the 30 million is definitely too much.
>
> Here are some errors from the hadoop log. It looks like it kept getting
> stuck on something which may point to the data being too big? The error
> below occured 12 times in a row.
>
> org.apache.hadoop.ipc.RemoteException: java.io.IOException:
> blk_2176268114489978801_654636 is already commited, storedBlock == null.
>
> Here is the entry from the HBase log.
>
> 15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Broken pipe
> 15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer:
> Aborting region server serverName=sacdnb08.dataraker.net,60020,1294089592450,
> load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught
> exception in service thread regionserver60020.compactor
> java.lang.OutOfMemoryError: Java heap space
>
> Thanks.
>
>
> On Wed, Jan 5, 2011 at 11:45 AM, Stack <[email protected]> wrote:
>
>> What was the server carrying?  How many regions?  What kinda of
>> loading was on the cluster?  We should not be OOME'ing.  Do you have
>> the heap dump lying around (We dump heap on OOME... its named *.hprof
>> or something.  If you have it, want to put it somewhere for me to pull
>> it so I can take a look?).  Any chance of a errant big cells?  Lots of
>> them?  What JVM version?
>>
>> St.Ack
>>
>> On Wed, Jan 5, 2011 at 8:10 AM, Wayne <[email protected]> wrote:
>> > I am still struggling with the JVM. We just had a hard OOM crash of a
>> region
>> > server after only running for 36 hours. Any help would be greatly
>> > appreciated. Do we need to restart nodes every 24 hours under load?  GC
>> > Pauses are something we are trying to plan for, but full out OOM crashes
>> are
>> > a new problem.
>> >
>> > The message below seems to be where it starts going bad. It is followed
>> by
>> > no less than 63 Concurrent Mode Failure errors over a 16 minute period.
>> >
>> > *GC locker: Trying a full collection because scavenge failed*
>> >
>> > Lastly here is the end (after the 63 CMF errors).
>> >
>> > Heap
>> >  par new generation   total 1887488K, used 303212K [0x00000005fae00000,
>> > 0x000000067ae00000, 0x000000067ae00000)
>> >  eden space 1677824K,  18% used [0x00000005fae00000, 0x000000060d61b078,
>> > 0x0000000661480000)
>> >  from space 209664K,   0% used [0x000000066e140000, 0x000000066e140000,
>> > 0x000000067ae00000)
>> >  to   space 209664K,   0% used [0x0000000661480000, 0x0000000661480000,
>> > 0x000000066e140000)
>> >  concurrent mark-sweep generation total 6291456K, used 2440155K
>> > [0x000000067ae00000, 0x00000007fae00000, 0x00000007fae00000)
>> >  concurrent-mark-sweep perm gen total 31704K, used 18999K
>> > [0x00000007fae00000, 0x00000007fccf6000, 0x0000000800000000)
>> >
>> > Here again are our custom settings in case there are some suggestions
>> out
>> > there. Are we making it worse with these settings? What should we try
>> next?
>> >
>> >        -XX:+UseCMSInitiatingOccupancyOnly
>> >        -XX:CMSInitiatingOccupancyFraction=60
>> >        -XX:+CMSParallelRemarkEnabled
>> >        -XX:SurvivorRatio=8
>> >        -XX:NewRatio=3
>> >        -XX:MaxTenuringThreshold=1
>> >
>> >
>> > Thanks!
>> >
>>
>
>

Re: JVM OOM

Reply via email to