k. Thanks for stack trace. Its unrelated to compactions but that don't mean compaction not responsible.
I'm filing an issue for wide rows causing OOME (HBASE-3421). Maybe its not the case but we should check it out at least because I know of others who have schemas where the row width can grow unbounded. Thanks, St.Ack On Wed, Jan 5, 2011 at 1:08 PM, Wayne <[email protected]> wrote: > Assuming it was big rows I dropped the table and have started loading it > again (I could not share the data anyway). We will see it again eventually > if it is not the large rows. We have changed our data model to be "taller" > so the 30 million column row is now gone for good. > > Below is the contents of the .out file from the first OOM. I found the hprof > files, but I have no public place to put them. At 10GB I can not even get > them to my laptop to look at either. > > Thanks. > > [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor2] > [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor3] > [Unloading class sun.reflect.GeneratedSerializationConstructorAccessor1] > [Unloading class sun.reflect.GeneratedConstructorAccessor10] > [Unloading class sun.reflect.GeneratedConstructorAccessor21] > [Unloading class sun.reflect.GeneratedConstructorAccessor23] > [Unloading class sun.reflect.GeneratedMethodAccessor17] > [Unloading class sun.reflect.GeneratedMethodAccessor1] > [Unloading class sun.reflect.GeneratedMethodAccessor19] > [Unloading class sun.reflect.GeneratedMethodAccessor18] > [Unloading class sun.reflect.GeneratedMethodAccessor16] > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid17874.hprof ... > Exception in thread "pool-1-thread-6" java.lang.OutOfMemoryError: Java > heap space > at org.apache.hadoop.hbase.client.Put.readFields(Put.java:500) > Heap dump file created [9689854890 bytes in 151.877 secs] > at > org.apache.hadoop.hbase.client.MultiPut.readFields(MultiPut.java:111) > at > org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:461) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:124) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Exception in thread "ResponseProcessor for block > blk_2176268114489978801_654636" java.lang.OutOfMemoryError: Java heap > space > at java.util.HashMap.newKeyIterator(Unknown Source) > at java.util.HashMap$KeySet.iterator(Unknown Source) > at java.util.HashSet.iterator(Unknown Source) > at sun.nio.ch.SelectorImpl.processDeregisterQueue(Unknown Source) > at sun.nio.ch.EPollSelectorImpl.doSelect(Unknown Source) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source) > at sun.nio.ch.SelectorImpl.select(Unknown Source) > at > org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) > at java.io.DataInputStream.readFully(Unknown Source) > at java.io.DataInputStream.readLong(Unknown Source) > at > org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:122) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2547) > Exception in thread "pool-1-thread-7" java.lang.OutOfMemoryError: Java > heap space > at org.apache.hadoop.hbase.client.Put.readFields(Put.java:495) > at > org.apache.hadoop.hbase.client.MultiPut.readFields(MultiPut.java:111) > at > org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:461) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:124) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:959) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:927) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:503) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:297) > at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at java.lang.Thread.run(Unknown Source) > Exception in thread "pool-1-thread-9" java.lang.OutOfMemoryError: Java > heap space > > > > On Wed, Jan 5, 2011 at 3:36 PM, Stack <[email protected]> wrote: > >> That could be it. Do you know names of participating files? The >> should be printed in the log. Can you save aside that Store, tar it >> up and put it somewhere we can pull (if data is ok for others to see)? >> If we are OOME'ing on big rows, thats something I know some fellas >> for sure would be interested in fixing. >> >> (There is actually a 'limit' on number of kvs to get at a time when >> scanning unused by the compaction code -- perhaps thats all it would >> take to fix this issue?) >> >> Its highly unlikely but I'll ask anyways, was the OOME in same >> location both times? (OOME stack trace may be in the .out file rather >> than the .log file). >> >> Location of the hprof is usually where the program was launched from >> (check $HBASE_HOME dir). >> >> St.Ack >> >> >> On Wed, Jan 5, 2011 at 11:24 AM, Wayne <[email protected]> wrote: >> > Pretty sure this is compaction. The same node OOME again along with >> another >> > node after starting compaction. Like cass* .6 I guess hbase can not >> handle a >> > row bigger than it can hold in memory. I always read a lot about big >> cells >> > being a problem, but this problem is big rows. >> > >> > Thanks. >> > >> > On Wed, Jan 5, 2011 at 12:13 PM, Wayne <[email protected]> wrote: >> > >> >> It was carrying ~9k writes/sec and has been for the last 24+ hours. >> There >> >> are 500+ regions on that node. I could not find the heap dump >> (location?) >> >> but we do have some errant big rows that have crashed before. When we >> query >> >> those big rows thrift has been crashing. Maybe major compaction kicked >> in >> >> for those rows (see last log entry below)? There are 30 million columns >> with >> >> all small cell values but the 30 million is definitely too much. >> >> >> >> Here are some errors from the hadoop log. It looks like it kept getting >> >> stuck on something which may point to the data being too big? The error >> >> below occured 12 times in a row. >> >> >> >> org.apache.hadoop.ipc.RemoteException: java.io.IOException: >> >> blk_2176268114489978801_654636 is already commited, storedBlock == null. >> >> >> >> Here is the entry from the HBase log. >> >> >> >> 15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer >> Exception: >> >> java.io.IOException: Broken pipe >> >> 15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: >> >> Aborting region server serverName=sacdnb08.dataraker.net >> ,60020,1294089592450, >> >> load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught >> >> exception in service thread regionserver60020.compactor >> >> java.lang.OutOfMemoryError: Java heap space >> >> >> >> Thanks. >> >> >> >> >> >> On Wed, Jan 5, 2011 at 11:45 AM, Stack <[email protected]> wrote: >> >> >> >>> What was the server carrying? How many regions? What kinda of >> >>> loading was on the cluster? We should not be OOME'ing. Do you have >> >>> the heap dump lying around (We dump heap on OOME... its named *.hprof >> >>> or something. If you have it, want to put it somewhere for me to pull >> >>> it so I can take a look?). Any chance of a errant big cells? Lots of >> >>> them? What JVM version? >> >>> >> >>> St.Ack >> >>> >> >>> On Wed, Jan 5, 2011 at 8:10 AM, Wayne <[email protected]> wrote: >> >>> > I am still struggling with the JVM. We just had a hard OOM crash of a >> >>> region >> >>> > server after only running for 36 hours. Any help would be greatly >> >>> > appreciated. Do we need to restart nodes every 24 hours under load? >> GC >> >>> > Pauses are something we are trying to plan for, but full out OOM >> crashes >> >>> are >> >>> > a new problem. >> >>> > >> >>> > The message below seems to be where it starts going bad. It is >> followed >> >>> by >> >>> > no less than 63 Concurrent Mode Failure errors over a 16 minute >> period. >> >>> > >> >>> > *GC locker: Trying a full collection because scavenge failed* >> >>> > >> >>> > Lastly here is the end (after the 63 CMF errors). >> >>> > >> >>> > Heap >> >>> > par new generation total 1887488K, used 303212K >> [0x00000005fae00000, >> >>> > 0x000000067ae00000, 0x000000067ae00000) >> >>> > eden space 1677824K, 18% used [0x00000005fae00000, >> 0x000000060d61b078, >> >>> > 0x0000000661480000) >> >>> > from space 209664K, 0% used [0x000000066e140000, >> 0x000000066e140000, >> >>> > 0x000000067ae00000) >> >>> > to space 209664K, 0% used [0x0000000661480000, >> 0x0000000661480000, >> >>> > 0x000000066e140000) >> >>> > concurrent mark-sweep generation total 6291456K, used 2440155K >> >>> > [0x000000067ae00000, 0x00000007fae00000, 0x00000007fae00000) >> >>> > concurrent-mark-sweep perm gen total 31704K, used 18999K >> >>> > [0x00000007fae00000, 0x00000007fccf6000, 0x0000000800000000) >> >>> > >> >>> > Here again are our custom settings in case there are some suggestions >> >>> out >> >>> > there. Are we making it worse with these settings? What should we try >> >>> next? >> >>> > >> >>> > -XX:+UseCMSInitiatingOccupancyOnly >> >>> > -XX:CMSInitiatingOccupancyFraction=60 >> >>> > -XX:+CMSParallelRemarkEnabled >> >>> > -XX:SurvivorRatio=8 >> >>> > -XX:NewRatio=3 >> >>> > -XX:MaxTenuringThreshold=1 >> >>> > >> >>> > >> >>> > Thanks! >> >>> > >> >>> >> >> >> >> >> > >> >
