That could be it. Do you know names of participating files? The should be printed in the log. Can you save aside that Store, tar it up and put it somewhere we can pull (if data is ok for others to see)? If we are OOME'ing on big rows, thats something I know some fellas for sure would be interested in fixing.
(There is actually a 'limit' on number of kvs to get at a time when scanning unused by the compaction code -- perhaps thats all it would take to fix this issue?) Its highly unlikely but I'll ask anyways, was the OOME in same location both times? (OOME stack trace may be in the .out file rather than the .log file). Location of the hprof is usually where the program was launched from (check $HBASE_HOME dir). St.Ack On Wed, Jan 5, 2011 at 11:24 AM, Wayne <[email protected]> wrote: > Pretty sure this is compaction. The same node OOME again along with another > node after starting compaction. Like cass* .6 I guess hbase can not handle a > row bigger than it can hold in memory. I always read a lot about big cells > being a problem, but this problem is big rows. > > Thanks. > > On Wed, Jan 5, 2011 at 12:13 PM, Wayne <[email protected]> wrote: > >> It was carrying ~9k writes/sec and has been for the last 24+ hours. There >> are 500+ regions on that node. I could not find the heap dump (location?) >> but we do have some errant big rows that have crashed before. When we query >> those big rows thrift has been crashing. Maybe major compaction kicked in >> for those rows (see last log entry below)? There are 30 million columns with >> all small cell values but the 30 million is definitely too much. >> >> Here are some errors from the hadoop log. It looks like it kept getting >> stuck on something which may point to the data being too big? The error >> below occured 12 times in a row. >> >> org.apache.hadoop.ipc.RemoteException: java.io.IOException: >> blk_2176268114489978801_654636 is already commited, storedBlock == null. >> >> Here is the entry from the HBase log. >> >> 15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: >> java.io.IOException: Broken pipe >> 15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: >> Aborting region server serverName=sacdnb08.dataraker.net,60020,1294089592450, >> load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught >> exception in service thread regionserver60020.compactor >> java.lang.OutOfMemoryError: Java heap space >> >> Thanks. >> >> >> On Wed, Jan 5, 2011 at 11:45 AM, Stack <[email protected]> wrote: >> >>> What was the server carrying? How many regions? What kinda of >>> loading was on the cluster? We should not be OOME'ing. Do you have >>> the heap dump lying around (We dump heap on OOME... its named *.hprof >>> or something. If you have it, want to put it somewhere for me to pull >>> it so I can take a look?). Any chance of a errant big cells? Lots of >>> them? What JVM version? >>> >>> St.Ack >>> >>> On Wed, Jan 5, 2011 at 8:10 AM, Wayne <[email protected]> wrote: >>> > I am still struggling with the JVM. We just had a hard OOM crash of a >>> region >>> > server after only running for 36 hours. Any help would be greatly >>> > appreciated. Do we need to restart nodes every 24 hours under load? GC >>> > Pauses are something we are trying to plan for, but full out OOM crashes >>> are >>> > a new problem. >>> > >>> > The message below seems to be where it starts going bad. It is followed >>> by >>> > no less than 63 Concurrent Mode Failure errors over a 16 minute period. >>> > >>> > *GC locker: Trying a full collection because scavenge failed* >>> > >>> > Lastly here is the end (after the 63 CMF errors). >>> > >>> > Heap >>> > par new generation total 1887488K, used 303212K [0x00000005fae00000, >>> > 0x000000067ae00000, 0x000000067ae00000) >>> > eden space 1677824K, 18% used [0x00000005fae00000, 0x000000060d61b078, >>> > 0x0000000661480000) >>> > from space 209664K, 0% used [0x000000066e140000, 0x000000066e140000, >>> > 0x000000067ae00000) >>> > to space 209664K, 0% used [0x0000000661480000, 0x0000000661480000, >>> > 0x000000066e140000) >>> > concurrent mark-sweep generation total 6291456K, used 2440155K >>> > [0x000000067ae00000, 0x00000007fae00000, 0x00000007fae00000) >>> > concurrent-mark-sweep perm gen total 31704K, used 18999K >>> > [0x00000007fae00000, 0x00000007fccf6000, 0x0000000800000000) >>> > >>> > Here again are our custom settings in case there are some suggestions >>> out >>> > there. Are we making it worse with these settings? What should we try >>> next? >>> > >>> > -XX:+UseCMSInitiatingOccupancyOnly >>> > -XX:CMSInitiatingOccupancyFraction=60 >>> > -XX:+CMSParallelRemarkEnabled >>> > -XX:SurvivorRatio=8 >>> > -XX:NewRatio=3 >>> > -XX:MaxTenuringThreshold=1 >>> > >>> > >>> > Thanks! >>> > >>> >> >> >
