HDFSSink throughput question

Chris Neal Fri, 01 Feb 2013 08:40:30 -0800

Thanks for the help Juhani :)  I'll take a look with Ganglia and see what
things look like.


Any thoughts on keeping the ExecSource.batchSize,
MemoryChannel.transactionCapacity, AvroSink.batch-size, and
HDFSSink.batchSize the same?

I looked at the MemoryChannel code, and noticed that there is a timeout
parameter passed to doCommit(), where the execption is being thrown.  Just
for fun, I increased it from the default to 10 seconds, and now things are
running smoothly with the same config as before.  It's been running for
about 24 hours now.  A step in the right direction anyway! :)

Thanks again.
Chris

On Thu, Jan 31, 2013 at 8:12 PM, Juhani Connolly <
juhani_conno...@cyberagent.co.jp> wrote:

>  Hi Chris,
>
> The most likely cause of that error is that the sinks are draining
> requests slower than your sources are feeding fresh data. Over time it will
> fill up the capacity of your memory channel, which will then start refusing
> additional put requests.
>
> You can confirm this by connecting with jmx or ganglia.
>
> If the write is extremely bursty, it's possible that it's just temporarily
> going over the sink consumption rate, and increasing the channel capacity
> could work. Otherwise, increasing the avro batch size, or adding additional
> avro sinks(more threads) may also help. I think that setting up ganglia
> monitoring and looking at the incoming and outgoing event counts and
> channel fill states helps a lot in diagnosing these bottlenecks, you should
> look into doing that.
>
>
> On 02/01/2013 02:01 AM, Chris Neal wrote:
>
> Hi all.
>
>  I need some thoughts on sizing/tuning of the above (common) route in
> FlumeNG to maximize throughput.  Here is my setup:
>
>  *Source JVM (ExecSource/MemoryChannel/AvroSink):*
> -Xmx4g
> -Xms4g
> -XX:MaxDirectMemorySize=256m
>
>  Number of ExecSources in config:  124 (yes, it's a ton.  Can't do
> anything about it :)  The write rate to the source files is fairly fast and
> bursty.
>
>  ExecSource.batchSize = 1000
> (so, when all 124 tail -F instances get 1000 events, they all dump to the
> memory channel)
>
>  MemoryChannel.capacity = 1000000
> MemoryChannel.transactionCapacity = 1000
> (somewhat unclear on what this is.  Docs say "The number of events stored
> in the channel per transaction", but what is a "transaction" to a
> MemoryChannel?)
>
>  AvroSink.batchSize = 1000
>
>  *Destination JVM (AvroSource/FileChannel/HDFSSink)*
> (Cluster of two JVMs on two servers, each configured the same as per below)
> -Xms=2g
> -Xmx=2g
> -XX:MaxDirectMemorySize is not defined, so whatever the default is
>
>  AvroSource.threads = 64
> FileChannel.transactionCapacity = 1000
> FileChannel.capacity = 32000000
> HDFSSink.batchSize = 1000
> HDFSSink.threadPoolSize = 64
>
>  With this configuration, in about 5 minutes, I get the common Exception:
>
>  "Space for commit to queue couldn't be acquired Sinks are likely not
> keeping up with sources, or the buffer size is too tight"
>
>  on the Source JVM.  It is no where near the 4g max, rather only at about
> 2.5g.
>
>  I'm wondering about the logic of having all the batch sizes/transaction
> sizes 1000.  My thought was that would keep from fragmenting the transfer
> of data, but maybe that's flawed?  Should the sizes be different?
>
>  Also curious about increasing the MaxDirectMemorySize to something
> larger than 256MB?  I tried removing it altogether in my Source JVM (which
> makes the size unbounded), but that didn't seem to make a difference.
>
>  I'm having some trouble figuring out where the backup is happening, and
> how to open up the gates. :)
>
>  Thanks in advance for any suggestions.
>  Chris
>
>
>

Re: ExecSource->MemoryChannel->AvroSink->AvroSource->FileChannel->HDFSSink throughput question

Reply via email to