Hello all,
My setup:
- Flume 1.4
- CDH 4.2.2 (2.0.0-cdh4.2.2)
I am testing a simple flume setup with a Sequence Generator Source, a File
Channel, and an HDFS Sink (see my flume.conf below). This configuration works
as expected until I reboot the cluster’s NameNode or until I restart the HDFS
service on the cluster. At this point, it appears that the Flume Agent cannot
reconnect to HDFS and must be manually restarted. Since this is not an uncommon
occurrence in our production cluster, it is important that Flume is able to
reconnect gracefully without any manual intervention.
So, how do we fix this HDFS reconnection issue?
Here is our flume.conf:
appserver.sources = rawtext
appserver.channels = testchannel
appserver.sinks = test_sink
appserver.sources.rawtext.type = seq
appserver.sources.rawtext.channels = testchannel
appserver.channels.testchannel.type = file
appserver.channels.testchannel.capacity = 10000000
appserver.channels.testchannel.minimumRequiredSpace = 214748364800
appserver.channels.testchannel.checkpointDir =
/Users/aoneill/Desktop/testchannel/checkpoint
appserver.channels.testchannel.dataDirs =
/Users/aoneill/Desktop/testchannel/data
appserver.channels.testchannel.maxFileSize = 20000000
appserver.sinks.test_sink.type = hdfs
appserver.sinks.test_sink.channel = testchannel
appserver.sinks.test_sink.hdfs.path =
hdfs://cluster01:8020/user/aoneill/flumetest
appserver.sinks.test_sink.hdfs.closeTries = 3
appserver.sinks.test_sink.hdfs.filePrefix = events-
appserver.sinks.test_sink.hdfs.fileSuffix = .avro
appserver.sinks.test_sink.hdfs.fileType = DataStream
appserver.sinks.test_sink.hdfs.writeFormat = Text
appserver.sinks.test_sink.hdfs.inUsePrefix = inuse-
appserver.sinks.test_sink.hdfs.inUseSuffix = .avro
appserver.sinks.test_sink.hdfs.rollCount = 100000
appserver.sinks.test_sink.hdfs.rollInterval = 30
appserver.sinks.test_sink.hdfs.rollSize = 10485760
These are the two error message that the Flume Agent outputs constantly after
the restart:
2014-08-26 10:47:24,572 (SinkRunner-PollingRunner-DefaultSinkProcessor)
[ERROR -
org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:96)]
Unexpected error while checking replication factor
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.flume.sink.hdfs.AbstractHDFSWriter.getNumCurrentReplicas(AbstractHDFSWriter.java:162)
at
org.apache.flume.sink.hdfs.AbstractHDFSWriter.isUnderReplicated(AbstractHDFSWriter.java:82)
at
org.apache.flume.sink.hdfs.BucketWriter.shouldRotate(BucketWriter.java:452)
at org.apache.flume.sink.hdfs.BucketWriter.append(BucketWriter.java:387)
at
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:392)
at
org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525)
at
org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
and
2014-08-26 10:47:29,592 (SinkRunner-PollingRunner-DefaultSinkProcessor)
[WARN -
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)] HDFS
IO error
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:207)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:525)
at
org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1253)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.transfer(DFSOutputStream.java:891)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:881)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:982)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:779)
at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
I can provide additional information if needed. Thank you very much for any
insight you are able to provide into this problem.
Best,
Andrew