From: Mangtani, Kushal
Sent: Wednesday, February 26, 2014 4:51 PM
To: '[email protected]'; '[email protected]'
Cc: Rangnekar, Rohit; '[email protected]'
Subject: File Channel Exception "Failed to obtain lock for writing to the
log.Try increasing the log write timeout value"
Hi,
I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting aggregated logs.
I am running a 2 tier(agent,collector) Flume Configuration with custom plugins.
There are approximately 20 agents (receiving data) and 6 collector flume
(writing to HDFS) machines all running independenly. However, I have been
facing some File Channel Exceptions on the collector side. The agent appears to
be working fine.
Error stacktrace:
org.apache.flume.ChannelException: Failed to
obtain lock for writing to the log. Try increasing the log write timeout value.
[channel=c2]
at
org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621)
at
org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168)
at
org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421)
at
org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at
org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
.....
And I keep on getting the same error
P.S :This same exception is repated in most of the
flume collector machines.But, not at the same duration. There is usually a
difference of a couple of hours or more.
1. HDFS sinks are written in the Amazon EC2 cloud instance.
2. datadir and checkpoint dir of file channel in all flume collector instances
are mounted to a separate hadoop ebs drive .This makes sure that two separate
collectors do not overlap their log and checkpoint dir. There is a symbolic
link i.e /usr/lib/flume-ng/datasource --> /hadoop/ebs/mnt-1
3. The Flume works fine for a couple of days and all the agent,collector are
initialized properly without exceptions.
Questions:
1. Exception "Failed to obtain lock for writing to the log. Try
increasing the log write timeout value . [channel=c2]" . According to the
documentation, such an exception occurs only if two processes are acceesing the
same file/directory. However, each channel is configured separately so No two
channels should access the same dir. Hence, this exception does not indicates
anything. Please correct me, if im wrong.
2. Also, HDFS.CallTimeout - indicates calling HDFS for open,write
operations. If no response within a duration, it timeouts. And , if its
timeouts; it closes the File. Please correct me, if im wrong. Also, if there
is a way to specify the number of retries before it closes the file?
Your inputs/suggestions will be thoroughly appreciated.
Regards
Kushal Mangtani
Software Engineer