Pretty sure this is compaction. The same node OOME again along with another node after starting compaction. Like cass* .6 I guess hbase can not handle a row bigger than it can hold in memory. I always read a lot about big cells being a problem, but this problem is big rows.
Thanks. On Wed, Jan 5, 2011 at 12:13 PM, Wayne <[email protected]> wrote: > It was carrying ~9k writes/sec and has been for the last 24+ hours. There > are 500+ regions on that node. I could not find the heap dump (location?) > but we do have some errant big rows that have crashed before. When we query > those big rows thrift has been crashing. Maybe major compaction kicked in > for those rows (see last log entry below)? There are 30 million columns with > all small cell values but the 30 million is definitely too much. > > Here are some errors from the hadoop log. It looks like it kept getting > stuck on something which may point to the data being too big? The error > below occured 12 times in a row. > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: > blk_2176268114489978801_654636 is already commited, storedBlock == null. > > Here is the entry from the HBase log. > > 15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: > java.io.IOException: Broken pipe > 15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: > Aborting region server serverName=sacdnb08.dataraker.net,60020,1294089592450, > load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught > exception in service thread regionserver60020.compactor > java.lang.OutOfMemoryError: Java heap space > > Thanks. > > > On Wed, Jan 5, 2011 at 11:45 AM, Stack <[email protected]> wrote: > >> What was the server carrying? How many regions? What kinda of >> loading was on the cluster? We should not be OOME'ing. Do you have >> the heap dump lying around (We dump heap on OOME... its named *.hprof >> or something. If you have it, want to put it somewhere for me to pull >> it so I can take a look?). Any chance of a errant big cells? Lots of >> them? What JVM version? >> >> St.Ack >> >> On Wed, Jan 5, 2011 at 8:10 AM, Wayne <[email protected]> wrote: >> > I am still struggling with the JVM. We just had a hard OOM crash of a >> region >> > server after only running for 36 hours. Any help would be greatly >> > appreciated. Do we need to restart nodes every 24 hours under load? GC >> > Pauses are something we are trying to plan for, but full out OOM crashes >> are >> > a new problem. >> > >> > The message below seems to be where it starts going bad. It is followed >> by >> > no less than 63 Concurrent Mode Failure errors over a 16 minute period. >> > >> > *GC locker: Trying a full collection because scavenge failed* >> > >> > Lastly here is the end (after the 63 CMF errors). >> > >> > Heap >> > par new generation total 1887488K, used 303212K [0x00000005fae00000, >> > 0x000000067ae00000, 0x000000067ae00000) >> > eden space 1677824K, 18% used [0x00000005fae00000, 0x000000060d61b078, >> > 0x0000000661480000) >> > from space 209664K, 0% used [0x000000066e140000, 0x000000066e140000, >> > 0x000000067ae00000) >> > to space 209664K, 0% used [0x0000000661480000, 0x0000000661480000, >> > 0x000000066e140000) >> > concurrent mark-sweep generation total 6291456K, used 2440155K >> > [0x000000067ae00000, 0x00000007fae00000, 0x00000007fae00000) >> > concurrent-mark-sweep perm gen total 31704K, used 18999K >> > [0x00000007fae00000, 0x00000007fccf6000, 0x0000000800000000) >> > >> > Here again are our custom settings in case there are some suggestions >> out >> > there. Are we making it worse with these settings? What should we try >> next? >> > >> > -XX:+UseCMSInitiatingOccupancyOnly >> > -XX:CMSInitiatingOccupancyFraction=60 >> > -XX:+CMSParallelRemarkEnabled >> > -XX:SurvivorRatio=8 >> > -XX:NewRatio=3 >> > -XX:MaxTenuringThreshold=1 >> > >> > >> > Thanks! >> > >> > >
