Hi All, (This is all about CDH3, so I am not sure whether it should go on this list, but I figure it is at least interesting for people trying the same.)
I've recently tried CDH3 on a new cluster from RPMs with the hadoop-lzo fork from https://github.com/toddlipcon/hadoop-lzo. Everything works like a charm initially, but after some time (minutes to max one hour), the RS JVM process memory grows to more than twice the given heap size and beyond. I have seen a RS with 16GB heap that grows to 55GB virtual size. At some point, everything start swapping and GC times go into the minutes and everything dies or is considered dead by the master. I did a pmap -x on the RS process and that shows a lot of allocated blocks of about 64M by the process. There about 500 of these, which is 32GB in total. See: http://pastebin.com/8pgzPf7b (bottom of the file, the blocks of about 1M on top are probably thread stacks). Unfortunately, Linux shows the native heap as anon blocks, so I can not link it to a specific lib or something. I am running the latest CDH3 and hadoop-lzo 0.4.6 (from said URL, the one which has the reinit() support). I run Java 6u21 with the G1 garbage collector, which has been running fine for some weeks now. Full command line is: java -Xmx16000m -XX:+HeapDumpOnOutOfMemoryError -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC -XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/export/logs/hbase/gc-hbase.log -Djava.library.path=/home/inr/java-lib/hbase/native/Linux-amd64-64 -Djava.net.preferIPv4Stack=true -Dhbase.log.dir=/export/logs/hbase -Dhbase.log.file=hbase-hbase-regionserver-w3r1.inrdb.ripe.net.log -Dhbase.home.dir=/usr/lib/hbase/bin/.. -Dhbase.id.str=hbase -Dhbase.r I searched the HBase source for something that could point to native heap usage (like ByteBuffer#allocateDirect(...)), but I could not find anything. Thread count is about 185 (I have 100 handlers), so nothing strange there as well. Question is, could this be HBase or is this a problem with the hadoop-lzo? I have currently downgraded to a version known to work, because we have a demo coming up. But still interested in the answer. Regards, Friso
