Thanks for the help Juhani :) I'll take a look with Ganglia and see what things look like.
Any thoughts on keeping the ExecSource.batchSize, MemoryChannel.transactionCapacity, AvroSink.batch-size, and HDFSSink.batchSize the same? I looked at the MemoryChannel code, and noticed that there is a timeout parameter passed to doCommit(), where the execption is being thrown. Just for fun, I increased it from the default to 10 seconds, and now things are running smoothly with the same config as before. It's been running for about 24 hours now. A step in the right direction anyway! :) Thanks again. Chris On Thu, Jan 31, 2013 at 8:12 PM, Juhani Connolly < juhani_conno...@cyberagent.co.jp> wrote: > Hi Chris, > > The most likely cause of that error is that the sinks are draining > requests slower than your sources are feeding fresh data. Over time it will > fill up the capacity of your memory channel, which will then start refusing > additional put requests. > > You can confirm this by connecting with jmx or ganglia. > > If the write is extremely bursty, it's possible that it's just temporarily > going over the sink consumption rate, and increasing the channel capacity > could work. Otherwise, increasing the avro batch size, or adding additional > avro sinks(more threads) may also help. I think that setting up ganglia > monitoring and looking at the incoming and outgoing event counts and > channel fill states helps a lot in diagnosing these bottlenecks, you should > look into doing that. > > > On 02/01/2013 02:01 AM, Chris Neal wrote: > > Hi all. > > I need some thoughts on sizing/tuning of the above (common) route in > FlumeNG to maximize throughput. Here is my setup: > > *Source JVM (ExecSource/MemoryChannel/AvroSink):* > -Xmx4g > -Xms4g > -XX:MaxDirectMemorySize=256m > > Number of ExecSources in config: 124 (yes, it's a ton. Can't do > anything about it :) The write rate to the source files is fairly fast and > bursty. > > ExecSource.batchSize = 1000 > (so, when all 124 tail -F instances get 1000 events, they all dump to the > memory channel) > > MemoryChannel.capacity = 1000000 > MemoryChannel.transactionCapacity = 1000 > (somewhat unclear on what this is. Docs say "The number of events stored > in the channel per transaction", but what is a "transaction" to a > MemoryChannel?) > > AvroSink.batchSize = 1000 > > *Destination JVM (AvroSource/FileChannel/HDFSSink)* > (Cluster of two JVMs on two servers, each configured the same as per below) > -Xms=2g > -Xmx=2g > -XX:MaxDirectMemorySize is not defined, so whatever the default is > > AvroSource.threads = 64 > FileChannel.transactionCapacity = 1000 > FileChannel.capacity = 32000000 > HDFSSink.batchSize = 1000 > HDFSSink.threadPoolSize = 64 > > With this configuration, in about 5 minutes, I get the common Exception: > > "Space for commit to queue couldn't be acquired Sinks are likely not > keeping up with sources, or the buffer size is too tight" > > on the Source JVM. It is no where near the 4g max, rather only at about > 2.5g. > > I'm wondering about the logic of having all the batch sizes/transaction > sizes 1000. My thought was that would keep from fragmenting the transfer > of data, but maybe that's flawed? Should the sizes be different? > > Also curious about increasing the MaxDirectMemorySize to something > larger than 256MB? I tried removing it altogether in my Source JVM (which > makes the size unbounded), but that didn't seem to make a difference. > > I'm having some trouble figuring out where the backup is happening, and > how to open up the gates. :) > > Thanks in advance for any suggestions. > Chris > > >