Yeah, I can script something up. Maybe increasing the timeouts was all it needed and you'll never see another reply to this thread... ;) But, if not, I'll post the thread dump next.
Thanks again for all the suggestions. Chris On Tue, Apr 16, 2013 at 2:57 PM, Brock Noland <[email protected]> wrote: > Any chance of getting a thread dump (or a few) on both the source and > destination agent during an incident? :) It'd be a little work, but a > script could look for "Unable to deliver event. Exception follows." and > do the thread dumps. > > > On Tue, Apr 16, 2013 at 2:49 PM, Chris Neal <[email protected]> wrote: > >> Thanks for all the input guys. :) >> >> @Hari: >> The FileChannel at the AvroSource is on a SAN disk, so I don't think it >> is the bottleneck in my case. It is the same disk for both checkpoint and >> data. My queue depth remains relatively stable around 2000, which doesn't >> bother me because of my batch size. >> >> On the AvroSink side, I have 126 ExecSources spread across 12 JVMs at 1GB >> heap each. Each VM has 4 AvroSinks across 2 separate servers, load >> balanced and round robined (2 connections to each). Is that a small enough >> number of connections to remove the thread parameter on the AvroSource? >> >> @Brock >> Each of my VMs on the downstream agents are 4GB heaps. I watch the >> MBeans pretty closely via jconsole, and they sit around 3GB used. >> >> Thanks again for all the help! >> >> >> On Tue, Apr 16, 2013 at 2:26 PM, Brock Noland <[email protected]> wrote: >> >>> Another possibility is that the downstream agent is near capacity from a >>> memory perspective. What is your heap size for these agents? >>> >>> >>> On Tue, Apr 16, 2013 at 2:19 PM, Hari Shreedharan < >>> [email protected]> wrote: >>> >>>> One possibility is that you are hitting the file channel disk too hard >>>> - do you have just one disk for checkpoint and data? It might be getting >>>> slow because of this? Also you should probably just remove the thread limit >>>> on AvroSource. It usually does not cause too much havoc unless you have a >>>> massive number of connections causing too many threads. >>>> >>>> -- >>>> Hari Shreedharan >>>> >>>> On Tuesday, April 16, 2013 at 12:07 PM, Chris Neal wrote: >>>> >>>> Thanks Hari. >>>> >>>> I increased both the connect and request timeouts to 40000ms, and I'm >>>> testing that now. I am talking on a LAN though, which is part of the >>>> reason I'm concerned. Seems like it might not actually be a network issue, >>>> but perhaps an overloaded AvroSource on the back end? >>>> >>>> >>>> On Tue, Apr 16, 2013 at 1:52 PM, Hari Shreedharan < >>>> [email protected]> wrote: >>>> >>>> Looks like you are hitting Avro IPC timeouts - you should probably >>>> increase it, especially if you are talking over WAN. >>>> >>>> -- >>>> Hari Shreedharan >>>> >>>> On Tuesday, April 16, 2013 at 11:38 AM, Chris Neal wrote: >>>> >>>> I'm seeing the same thing :) >>>> >>>> Mine is all on a local LAN though, so the fact that an RPC call doesn't >>>> reply in 10000ms or 20000ms is quite odd. My configuration is for the most >>>> part the same as Denis' configuration. Two tiered system, ExecSources >>>> running tail -F on log files to an AvroSink, to an AvroSource, loading into >>>> HDFS on the back tier. >>>> >>>> I, too, see this on the AvroSink >>>> >>>> Either (A): >>>> [2013-04-15 23:57:14.827] >>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN] >>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] [] >>>> (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume >>>> event. Attempting next sink if available. >>>> org.apache.flume.EventDeliveryException: Failed to send events >>>> at org.apache.flume.sink.AvroSink.process(AvroSink.java:324) >>>> at >>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151) >>>> at >>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) >>>> at java.lang.Thread.run(Thread.java:619) >>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient >>>> { host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236) >>>> at org.apache.flume.sink.AvroSink.process(AvroSink.java:308) >>>> ... 3 more >>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient >>>> { host: hadoopjt01.pegs.com, port: 10000 }: Handshake timed out after >>>> 20000ms >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:280) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224) >>>> ... 4 more >>>> Caused by: java.util.concurrent.TimeoutException >>>> at >>>> java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:228) >>>> at java.util.concurrent.FutureTask.get(FutureTask.java:91) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:278) >>>> ... 5 more >>>> >>>> or (B): >>>> [2013-04-15 19:49:01.135] >>>> [org.apache.flume.sink.LoadBalancingSinkProcessor] [ WARN] >>>> [SinkRunner-PollingRunner-LoadBalancingSinkProcessor] [] >>>> (LoadBalancingSinkProcessor.java:process:154) Sink failed to consume >>>> event. Attempting next sink if available. >>>> org.apache.flume.EventDeliveryException: Failed to send events >>>> at org.apache.flume.sink.AvroSink.process(AvroSink.java:324) >>>> at >>>> org.apache.flume.sink.LoadBalancingSinkProcessor.process(LoadBalancingSinkProcessor.java:151) >>>> at >>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) >>>> at java.lang.Thread.run(Thread.java:619) >>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient >>>> { host: hadoopjt01.pegs.com, port: 10000 }: Failed to send batch >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236) >>>> at org.apache.flume.sink.AvroSink.process(AvroSink.java:308) >>>> ... 3 more >>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient >>>> { host: hadoopjt01.pegs.com, port: 10000 }: RPC request timed out >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224) >>>> ... 4 more >>>> Caused by: java.util.concurrent.TimeoutException >>>> at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310) >>>> ... 6 more >>>> >>>> The only thing I see on the AvroSource tier is the disconnect/reconnect >>>> happening: >>>> >>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] >>>> [pool-11-thread-10] [] (NettyServer.java:handleUpstream:171) [id: >>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] DISCONNECTED >>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] >>>> [pool-11-thread-10] [] (NettyServer.java:handleUpstream:171) [id: >>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] UNBOUND >>>> [2013-04-15 19:49:00.992] [org.apache.avro.ipc.NettyServer] [ INFO] >>>> [pool-11-thread-10] [] (NettyServer.java:handleUpstream:171) [id: >>>> 0x2a24ed78, /138.113.127.4:34481 :> /138.113.127.72:10000] CLOSED >>>> [2013-04-15 19:49:00.993] [org.apache.avro.ipc.NettyServer] [ INFO] >>>> [pool-11-thread-10] [] (NettyServer.java:channelClosed:209) Connection to >>>> / >>>> 138.113.127.4:34481 disconnected. >>>> [2013-04-15 19:49:03.331] [org.apache.avro.ipc.NettyServer] [ INFO] >>>> [pool-10-thread-1] [] (NettyServer.java:handleUpstream:171) [id: >>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] OPEN >>>> [2013-04-15 19:49:03.332] [org.apache.avro.ipc.NettyServer] [ INFO] >>>> [pool-11-thread-13] [] (NettyServer.java:handleUpstream:171) [id: >>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] BOUND: / >>>> 138.113.127.72:10000 >>>> [2013-04-15 19:49:03.333] [org.apache.avro.ipc.NettyServer] [ INFO] >>>> [pool-11-thread-13] [] (NettyServer.java:handleUpstream:171) [id: >>>> 0x3883b82e, /138.113.127.4:62442 => /138.113.127.72:10000] CONNECTED: / >>>> 138.113.127.4:62442 >>>> >>>> Is there some way to determine exactly where this bottleneck is? The >>>> config options for AvroSource/Sink are quite short, so there is not much >>>> tuning to do. Here is what I have: >>>> >>>> # avro-hadoopjt01-sink properties >>>> udprodae01.sinks.avro-hadoopjt01-sink.type = avro >>>> udprodae01.sinks.avro-hadoopjt01-sink.hostname = hadoopjt01.pegs.com >>>> udprodae01.sinks.avro-hadoopjt01-sink.port = 10000 >>>> udprodae01.sinks.avro-hadoopjt01-sink.batch-size = 100 >>>> >>>> # avro-hadoopjt01-source properties >>>> hadoopjt01-1.sources.avro-hadoopjt01-source.type = avro >>>> hadoopjt01-1.sources.avro-hadoopjt01-source.bind = hadoopjt01.pegs.com >>>> hadoopjt01-1.sources.avro-hadoopjt01-source.port = 10000 >>>> hadoopjt01-1.sources.avro-hadoopjt01-source.threads = 64 >>>> >>>> I can try increasing the AvroSink timeout values, but they seem quite >>>> adequate at the defaults. Maybe more threads on the AvroSource? >>>> >>>> Any advice would be much appreciated! >>>> Chris >>>> >>>> >>>> On Wed, Feb 6, 2013 at 12:10 PM, Denis Lowe <[email protected]>wrote: >>>> >>>> I noticed that the logs in the destination server were reporting >>>> dropped connections from the upstream server: >>>> >>>> NettyServer$NettyServerAvroHandler.handleUpstream:171) - [id: >>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] DISCONNECTED >>>> NettyServer$NettyServerAvroHandler.handleUpstream:171) - [id: >>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] UNBOUND >>>> NettyServer$NettyServerAvroHandler.handleUpstream:171) - [id: >>>> 0x08e56d76, /SOURCE_HOST:43599 :> /LOCAL_HOST:4003] CLOSED >>>> NettyServer$NettyServerAvroHandler.channelClosed:209) - Connection to >>>> /SOURCE_HOST:43599 disconnected. >>>> >>>> The other thing I observed is that these errors only ever occur from >>>> our EU servers (connecting to our centralized downstream collectors in US >>>> West) - We are running Flume in Amazon EC2 >>>> >>>> I can see in the log that the connection is restored quite quickly. >>>> >>>> I gather that the network latency between Europe and US West is causing >>>> the connection between the 2 servers to 'appear' lost, thus resulting in >>>> the above errors? >>>> >>>> Are there any recommended config settings to compensate for this? >>>> >>>> On 6 February 2013 00:21, Juhani Connolly < >>>> [email protected]> wrote: >>>> >>>> Is there anything unusual in the logs for the destination(avroSource) >>>> server >>>> >>>> Since the error is happening in the AvroSink, no data is getting lost. >>>> The failed data will get rolled back, removal from the local channel is >>>> cancelled, and it will attempt to resend it. >>>> >>>> >>>> On 02/06/2013 03:23 PM, Denis Lowe wrote: >>>> >>>> We are running Flume-NG 1.3.1 and have noticed periodically the >>>> following ERROR occurring (a few times daily): >>>> >>>> We are using the File Channel connecting to 2 downstream collector >>>> agents in 'round_robin' mode, using avro source/sinks. >>>> >>>> We are using the config described below to deliver 5 different log >>>> types (to 5 different ports downstream) and have observed the below >>>> error occurring randomly across all the ports. >>>> >>>> We tried doubling the connect-timeout to 40000 (from the default of >>>> 20000) with no success. >>>> The agent appears to recover and keep on processing data. >>>> >>>> My question is: >>>> Has this data been lost? or will flume eventually retry until a >>>> successfull delivery has been made? >>>> Are there any other config changes I can make to prevent/reduce >>>> this occurring in the future? >>>> >>>> 05 Feb 2013 23:12:21,650 ERROR >>>> [SinkRunner-PollingRunner-DefaultSinkProcessor] >>>> (org.apache.flume.SinkRunner$PollingRunner.run:160) - Unable to deliver >>>> event. Exception follows. >>>> org.apache.flume.EventDeliveryException: Failed to send events >>>> at org.apache.flume.sink.AvroSink.process(AvroSink.java:325) >>>> at >>>> org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68) >>>> at >>>> org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147) >>>> at java.lang.Thread.run(Thread.java:662) >>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient >>>> { host: collector1, port: 4003 }: Failed to send batch >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:236) >>>> at org.apache.flume.sink.AvroSink.process(AvroSink.java:309) >>>> ... 3 more >>>> Caused by: org.apache.flume.EventDeliveryException: NettyAvroRpcClient >>>> { host: collector2, port: 4003 }: RPC request timed out >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:321) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:295) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.appendBatch(NettyAvroRpcClient.java:224) >>>> ... 4 more >>>> Caused by: java.util.concurrent.TimeoutException >>>> at org.apache.avro.ipc.CallFuture.get(CallFuture.java:132) >>>> at >>>> org.apache.flume.api.NettyAvroRpcClient.waitForStatusOK(NettyAvroRpcClient.java:310) >>>> ... 6 more >>>> >>>> Below is a snapshot the current config: >>>> >>>> agent.sources.eventdata.command = tail -qn +0 -F >>>> /var/log/event-logs/live/eventdata.log >>>> agent.sources.eventdata.channels = eventdata-avro-channel >>>> agent.sources.eventdata.batchSize = 10 >>>> agent.sources.eventdata.restart = true >>>> >>>> ## event source interceptor >>>> agent.sources.eventdata.interceptors.host-interceptor.type = >>>> org.apache.flume.interceptor.HostInterceptor$Builder >>>> agent.sources.eventdata.interceptors.host-interceptor.hostHeader = >>>> source-host >>>> agent.sources.eventdata.interceptors.host-interceptor.useIP = false >>>> >>>> ## eventdata channel >>>> agent.channels.eventdata-avro-channel.type = file >>>> agent.channels.eventdata-avro-channel.checkpointDir = >>>> /mnt/flume-ng/checkpoint/eventdata-avro-channel >>>> agent.channels.eventdata-avro-channel.dataDirs = >>>> /mnt/flume-ng/data1/eventdata-avro-channel,/mnt/flume-ng/data2/eventdata-avro-channel >>>> agent.channels.eventdata-avro-channel.maxFileSize = 210000000 >>>> agent.channels.eventdata-avro-channel.capacity = 2000000 >>>> agent.channels.eventdata-avro-channel.checkpointInterval = 300000 >>>> agent.channels.eventdata-avro-channel.transactionCapacity = 10000 >>>> agent.channels.eventdata-avro-channel.keep-alive = 20 >>>> agent.channels.eventdata-avro-channel.write-timeout = 20 >>>> >>>> ## 2 x downstream click sinks for load balancing and failover >>>> agent.sinks.eventdata-avro-sink-1.type = avro >>>> agent.sinks.eventdata-avro-sink-1.channel = eventdata-avro-channel >>>> agent.sinks.eventdata-avro-sink-1.hostname = collector1 >>>> agent.sinks.eventdata-avro-sink-1.port = 4003 >>>> agent.sinks.eventdata-avro-sink-1.batch-size = 100 >>>> agent.sinks.eventdata-avro-sink-1.connect-timeout = 40000 >>>> >>>> agent.sinks.eventdata-avro-sink-2.type = avro >>>> agent.sinks.eventdata-avro-sink-2.channel = eventdata-avro-channel >>>> agent.sinks.eventdata-avro-sink-2.hostname = collector2 >>>> agent.sinks.eventdata-avro-sink-2.port = 4003 >>>> agent.sinks.eventdata-avro-sink-2.batch-size = 100 >>>> agent.sinks.eventdata-avro-sink-2.connect-timeout = 40000 >>>> >>>> ## load balance config >>>> agent.sinkgroups = eventdata-avro-sink-group >>>> agent.sinkgroups.eventdata-avro-sink-group.sinks = >>>> eventdata-avro-sink-1 eventdata-avro-sink-2 >>>> agent.sinkgroups.eventdata-avro-sink-group.processor.type = load_balance >>>> agent.sinkgroups.eventdata-avro-sink-group.processor.selector = >>>> round_robin >>>> >>>> Denis. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >>> >> >> > > > -- > Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >
