Try turning on HeapDumpOnOutOfMemoryError so we can peek at the heap dump.
-- Brock Noland Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Friday, March 1, 2013 at 5:57 PM, Denis Lowe wrote: > process failed - java.lang.OutOfMemoryError > > We observed the following error: > 01 Mar 2013 21:37:24,807 ERROR > [SinkRunner-PollingRunner-DefaultSinkProcessor] > (org.apache.flume.sink.hdfs.HDFSEventSink.process:460) - process failed > java.lang.OutOfMemoryError > at org.apache.hadoop.io.compress.zlib.ZlibCompressor.init(Native > Method) > at > org.apache.hadoop.io.compress.zlib.ZlibCompressor.<init>(ZlibCompressor.java:222) > at > org.apache.hadoop.io.compress.GzipCodec$GzipZlibCompressor.<init>(GzipCodec.java:159) > at > org.apache.hadoop.io.compress.GzipCodec.createCompressor(GzipCodec.java:109) > at > org.apache.hadoop.io.compress.GzipCodec.createOutputStream(GzipCodec.java:92) > at > org.apache.flume.sink.hdfs.HDFSCompressedDataStream.open(HDFSCompressedDataStream.java:70) > at > org.apache.flume.sink.hdfs.BucketWriter.doOpen(BucketWriter.java:216) > at > org.apache.flume.sink.hdfs.BucketWriter.access$000(BucketWriter.java:53) > at > org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:172) > at > org.apache.flume.sink.hdfs.BucketWriter$1.run(BucketWriter.java:170) > at > org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:143) > at org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:170) > at > org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:364) > at > org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:729) > at > org.apache.flume.sink.hdfs.HDFSEventSink$2.call(HDFSEventSink.java:727) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > > Unfortunately the error does not state if it is because of lack of Heap, Perm > or Direct Memory? > > Looking at the system memory we could see that we were using 3GB of 7GB (ie > less than half of the physical memory was used) > > Using VisualVM profiler we could see that we had not maxed out the Heap > Memory 75MB of 131MB (allocated) > PermGen was fine 16MB of 27MB (allocated) > > Buffer Usage is as follows: > Direct Memory: > < 50MB (this gets freed after each GC) > > Mapped Memory: > count 9 > 144MB (always stays constant) > > I'm assuming the -XX:MaxDirectMemorySize is for Direct Buffer Memory usage > NOT Mapped buffer Memory? > > The other thing we noticed was that after restart the flume process "RES" > size starts at around 200MB and then over a period of a week will grow up to > 3GB after which we observed the above error. > Unfortunately we cannot see where this 3GB of memory is being used when > profiled with VisualVM and JConsole (max heap size is set to 256MB) - there > definitely appears to be a slow memory leak? > > Flume is the only process running on this server: > 64bit Centos > java version "1.6.0_27" (64bit) > > The flume collector is configured with 8 file channels writing to S3 using > the HDFS sink. (8 upstream servers a pushing events to 2 downsteam > collectors) > > Each of the 8 channels/sinks is configured as follows: > ## impression source > agent.sources.impressions.type = avro > agent.sources.impressions.bind = 0.0.0.0 > agent.sources.impressions.port = 5001 > agent.sources.impressions.channels = impressions-s3-channel > ## impression channel > agent.channels.impressions-s3-channel.type = file > agent.channels.impressions-s3-channel.checkpointDir = > /mnt/flume-ng/checkpoint/impressions-s3-channel > agent.channels.impressions-s3-channel.dataDirs = > /mnt/flume-ng/data1/impressions-s3-channel,/mnt/flume-ng/data2/impressions-s3-channel > agent.channels.impressions-s3-channel.maxFileSize = 210000000 > agent.channels.impressions-s3-channel.capacity = 2000000 > agent.channels.impressions-s3-channel.checkpointInterval = 300000 > agent.channels.impressions-s3-channel.transactionCapacity = 10000 > # impression s3 sink > agent.sinks.impressions-s3-sink.type = hdfs > agent.sinks.impressions-s3-sink.channel = impressions-s3-channel > agent.sinks.impressions-s3-sink.hdfs.path = s3n://KEY:SECRET_KEY@S3-PATH > agent.sinks.impressions-s3-sink.hdfs.filePrefix = > impressions-%{collector-host} > agent.sinks.impressions-s3-sink.hdfs.callTimeout = 0 > agent.sinks.impressions-s3-sink.hdfs.rollInterval = 3600 > agent.sinks.impressions-s3-sink.hdfs.rollSize = 450000000 > agent.sinks.impressions-s3-sink.hdfs.rollCount = 0 > agent.sinks.impressions-s3-sink.hdfs.codeC = gzip > agent.sinks.impressions-s3-sink.hdfs.fileType = CompressedStream > agent.sinks.impressions-s3-sink.hdfs.batchSize = 100 > > I am using flume-ng 1.3.1 with the following parameters: > JAVA_OPTS="-Xms64m -Xmx256m -Xss128k -XX:MaxDirectMemorySize=256m > -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -verbose:gc -Xloggc:/mnt/logs/flume-ng/gc.log" > > We have 2 collectors running and they both fail at pretty much the same time. > > So from what i can see there appears to be a slow memory leak with the HDFS > sink, but have no idea how track this down or what alternate configuration i > can use to prevent this from happening again? > > Any ideas would be greatly appreciated? >
