See https://issues.apache.org/jira/browse/FLUME-2307
This jira removed the write-timeout, but that only makes sure that there is no transaction in limbo. The real reason like I said is slow IO. Try using provisioned IO for better throughput. Thanks, Hari On Thursday, February 27, 2014 at 10:48 AM, Mangtani, Kushal wrote: > Hari, > > Thanks for the prompt reply. The current file channel’s write-timeout = 30 > sec .EBS drive current capacity = 200 GB . The rate of writes is 60 > events/min; where each event is approx. 40 KB. > > I am thinking of increase file channel write-timeout to 60 sec. What do you > suggest? > Also,one strange thing I noticed all the flume-collectors also get the same > exception.However, all have a separate ebs drive. Any inputs? > > Thanks, > Kushal Mangtani > > From: Hari Shreedharan [mailto:[email protected]] > Sent: Thursday, February 27, 2014 10:35 AM > To: [email protected] (mailto:[email protected]) > Subject: Re: File Channel Exception "Failed to obtain lock for writing to the > log.Try increasing the log write timeout value" > > For now, increase the file channel’s write-timeout parameter to around 30 or > so (basically file channel is timing out while writing to disk). But the > basic problem you are seeing is that your EBS instance is very slow and IO is > taking too long. You either need to increase your EBS IO capacity, or reduce > the rate or writes. > > > > > Thanks, > > Hari > > > > > On Thursday, February 27, 2014 at 10:28 AM, Mangtani, Kushal wrote: > > > > > > > > > > > > > > > > From: Mangtani, Kushal > > Sent: Wednesday, February 26, 2014 4:51 PM > > To: '[email protected] (mailto:[email protected])'; > > '[email protected] (mailto:[email protected])' > > Cc: Rangnekar, Rohit; '[email protected] (mailto:[email protected])' > > Subject: File Channel Exception "Failed to obtain lock for writing to the > > log.Try increasing the log write timeout value" > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > I'm using Flume-Ng 1.4 cdh4.4 Tarball for collecting aggregated logs. > > > > > > I am running a 2 tier(agent,collector) Flume Configuration with custom > > plugins. There are approximately 20 agents (receiving data) and 6 collector > > flume (writing to HDFS) machines all running independenly. However, I have > > been facing some File Channel Exceptions on the collector side. The agent > > appears to be working fine. > > > > > > > > > > > > > > Error stacktrace: > > > > > > org.apache.flume.ChannelException: Failed to > > obtain lock for writing to the log. Try increasing the log write timeout > > value. [channel=c2] > > > > > > at > > org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doRollback(FileChannel.java:621) > > > > > > at > > org.apache.flume.channel.BasicTransactionSemantics.rollback(BasicTransactionSemantics.java:168) > > > > > > at > > org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:421) > > > > > > at > > org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68) > > > > > > at > > org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) > > > > > > ….. > > > > > > And I keep on getting the same error > > > > > > > > > > > > P.S :This same exception is repated in most of > > the flume collector machines.But, not at the same duration. There is > > usually a difference of a couple of hours or more. > > > > > > > > > > > > > > > > 1. HDFS sinks are written in the Amazon EC2 cloud instance. > > > > > > > > 2. datadir and checkpoint dir of file channel in all flume collector > > instances are mounted to a separate hadoop ebs drive .This makes sure that > > two separate collectors do not overlap their log and checkpoint dir. There > > is a symbolic link i.e /usr/lib/flume-ng/datasource à /hadoop/ebs/mnt-1 > > > > > > > > 3. The Flume works fine for a couple of days and all the agent,collector > > are initialized properly without exceptions. > > > > > > > > > > > > > > Questions: > > > > > > Exception “Failed to obtain lock for writing to the log. Try increasing the > > log write timeout value . [channel=c2]” . According to the documentation, > > such an exception occurs only if two processes are acceesing the same > > file/directory. However, each channel is configured separately so No two > > channels should access the same dir. Hence, this exception does not > > indicates anything. Please correct me, if im wrong. > > > > > > Also, HDFS.CallTimeout – indicates calling HDFS for open,write operations. > > If no response within a duration, it timeouts. And , if its timeouts; it > > closes the File. Please correct me, if im wrong. Also, if there is a way > > to specify the number of retries before it closes the file? > > > > > > > > > > > > Your inputs/suggestions will be thoroughly appreciated. > > > > > > > > > > > > > > > > > > Regards > > > > > > Kushal Mangtani > > > > > > Software Engineer > > > > > > > > > > > > > > > > > > > > > > > >
