Avro agent.

Roshan Naik Tue, 12 Mar 2013 15:38:27 -0700

There would be less contention if you could reduce the sharing... so
may be divide them them into 31 per channel. 31 still looks like a
huge number. Best if you can you consolidate 31 down to just 1 or 2 ?


Keep in mind there is one thread per sink and one per source (unless
you are spawning more inside your source / sink). A rule of thumb
(actually more like guidance) is 2 to 4 threads per core. So keep the
an eye out for not overloading your box with too many threads.

On Tue, Mar 12, 2013 at 2:55 PM, Chris Neal <[email protected]> wrote:
> So, in a 4 channel setup, would I bind each of the 124 sources to all of the
> 4 channels, or divide them up and put 31 sources on each individual channel?
> :)
>
>
> On Tue, Mar 12, 2013 at 4:40 PM, Chris Neal <[email protected]> wrote:
>>
>> Beautiful.  Will try 4 channels in one Agent first.
>> Thanks!
>>
>>
>> On Tue, Mar 12, 2013 at 4:35 PM, Roshan Naik <[email protected]>
>> wrote:
>>>
>>> Even 16 on a single channel might be on the higher side IMHO.
>>>
>>> Try instead splitting into four channels with 4 sinks each... or even
>>> four agents with one channel and 4 sinks each ..... it will reduce
>>> contention. be careful to ensure your capacity of each channel is not
>>> too high since you now have many channels.
>>> -roshan
>>>
>>> On Tue, Mar 12, 2013 at 2:24 PM, Chris Neal <[email protected]> wrote:
>>> > Thanks for the reply.  You're definitely on to something with the
>>> > ever-increasing number of sinks.  :)
>>> >
>>> > I scaled it back to 16 AvroSinks, and used a
>>> > MemoryChannel.transactionCapacity of 1000, and AvroSink.batch-size of
>>> > 1000.
>>> > My ExecSource.batchSize is 100 (I chose this smaller number because
>>> > there
>>> > are so many of them (124), I didn't want 10s of thousands of events
>>> > getting
>>> > dropped on the MemoryChannel at once, rather just 1000s).  With those
>>> > settings, things are keeping the MemoryChannel drained.  Finally
>>> > getting
>>> > somewhere! :)
>>> >
>>> > Much appreciate the prompt response.  If anything else comes to mind,
>>> > please
>>> > do let me know.
>>> >
>>> > Thanks again.
>>> > Chris
>>> >
>>> >
>>> >
>>> > On Tue, Mar 12, 2013 at 4:12 PM, Roshan Naik <[email protected]>
>>> > wrote:
>>> >>
>>> >> i meant 640,000 not 64,000
>>> >>
>>> >> On Tue, Mar 12, 2013 at 2:10 PM, Roshan Naik <[email protected]>
>>> >> wrote:
>>> >> > beyond a certain # of sinks it wont help adding more. my suspicion
>>> >> > is
>>> >> > you may have gone way overboard.
>>> >> >
>>> >> >  if your sink-side batch size is that large and you have 64 sinks in
>>> >> > the round-robin.. it will take a lot of events (64,000) to be pumped
>>> >> > in by the source order before the first event can start trickling
>>> >> > out
>>> >> > of any sink.  Also memory consumption will be quite high.. each sink
>>> >> > will open a transaction and hold on to 10000 events. This the cause
>>> >> > for the Memory channel filling up. Until the sink side transaction
>>> >> > is
>>> >> > committed (i.e 10k events are pulled), the memory reservation on the
>>> >> > channel is not relinquished. So your memory channel size will have
>>> >> > to
>>> >> > really high to support so manch sinks each with such a big batch
>>> >> > size.
>>> >> >
>>> >> > My gut feel is that your source-side batch size is not much of an
>>> >> > issue and can be smaller. Increasing the number of sinks will only
>>> >> > help if the sink is indeed the bott
>>> >> >
>>> >> > On Tue, Mar 12, 2013 at 1:43 PM, Chris Neal <[email protected]>
>>> >> > wrote:
>>> >> >> Hi all.
>>> >> >>
>>> >> >> I've been working on this for quite some time, and need some advice
>>> >> >> from the
>>> >> >> experts.  I have a two tiered Flume architecture:
>>> >> >>
>>> >> >> App Tier (all on one server):
>>> >> >>  124 ExecSources -> MemoryChannel -> AvroSinks
>>> >> >>
>>> >> >> HDFS Tier (on two servers):
>>> >> >>   AvroSource -> FileChannel -> HDFSSinks
>>> >> >>
>>> >> >> When I run the agents, the HDFS tier is keeping up fine with the
>>> >> >> App
>>> >> >> Tier.
>>> >> >> queue sizes stay between 0-10000 (I have a batch size of 10000).
>>> >> >> All
>>> >> >> is
>>> >> >> good.
>>> >> >>
>>> >> >> On the App Tier, when I view the JMX data through jconsole, I watch
>>> >> >> the
>>> >> >> size
>>> >> >> of the MemoryChannel grow steadily until it reaches the max, then
>>> >> >> it
>>> >> >> starts
>>> >> >> throwing exceptions about not being able to put the batch on the
>>> >> >> channel as
>>> >> >> expected.
>>> >> >>
>>> >> >> There seems to be two basic ways to increase the throughput of the
>>> >> >> App
>>> >> >> Tier:
>>> >> >> 1.  Increase the MemoryChannel's transactionCapacity and the
>>> >> >> corresponding
>>> >> >> AvroSink's batch-size.  Both are set to 10000 for me.
>>> >> >> 2.  Increase the number of AvroSinks to drain the MemoryChannel.
>>> >> >> I'm
>>> >> >> up to
>>> >> >> 64 Sinks now which round-robin between the two Flume Agents on the
>>> >> >> HDFS
>>> >> >> tier.
>>> >> >>
>>> >> >> Both of those values seem quite high to me (batch size and number
>>> >> >> of
>>> >> >> sinks).
>>> >> >>
>>> >> >> Am I missing something as far as tuning?
>>> >> >> Which would allow for greater increase to throughput, more Sinks or
>>> >> >> larger
>>> >> >> batch size?
>>> >> >>
>>> >> >> I'm stumped here.  I still think I can get this to work. :)
>>> >> >>
>>> >> >> Any suggestions are most welcome.
>>> >> >> Thanks for your time.
>>> >> >> Chris
>>> >> >>
>>> >
>>> >
>>
>>
>

Re: Best way to increase throughput of Exec->Memory->Avro agent.

Reply via email to