We have a similar setup (Flume 1.3) and same problems here. Increasing
the batch size did not help much but setting up multiple AvroSinks did.
On 26/7/2013 9:31, Pankaj Gupta wrote:
Hi,
We are trying to figure out how to get better throughput in our flume
pipeline. We have flume instances on a lot of machines writing to a
few collector machines running with a File Channel which in turn write
to still fewer hdfs writer machines running with a File Channel and
HDFS Sinks.
The problem that we're facing is that we are not getting good network
usage between our flume collector machines and hdfs writer machines.
The way these machines are connected is that the filechannel on
collector drains to an Avro Sink which sends to Avro Source on the
writer machine, which in turn writes to a filechannel draining into an
HDFS Sink. So:
[FileChannel -> Avro Sink] -> [Avro Source -> FileChannel -> HDFS Sink]
I did a raw network throughput test(using netcat on the command line)
between the collector and the writer and saw a throughput of
~*200Megabits*/sec. Whereas the network throughput (which I observed
using iftop) between collector avro sink and writer avro source never
went over *25Megabits*/sec, even when the filechannel on the collector
was quite full with millions of events queued up. We obviously want to
use the network better and I am exploring ways of achieving that. The
batch size we are using on avro sink on the collector is 4000.
I have a few questions regarding how AvroSource and Sink work together
to help me improve the throughput and will really appreciate a response:
1. Are the batches from flume source sent to the sink in a pipelined
fasion or is the next batch only sent once an ack for previous
batch is received?
2. If the batch send is not pipelined then would increasing the
number of sinks draining from the channel help.
The idea behind this is to basically achieve pipelining by having
multiple outstanding requests and thus use network better.
3. If batch size is very large, e.g. 1 million, would the batch only
be sent once that many events have accumulated or is there a time
limit after which whatever events are accumulated are sent? Is
this timelimit configurable? (I looked in the Avro Sink
documentation for such a setting:
http://flume.apache.org/FlumeUserGuide.html, but couldn't find
anything, hence asking the question)
4. Does enabling ssl have any significant impact on throughput?
Increase in latency is expected but does this also affect throughput.
We are using flume 1.4.0.
Thanks in Advance,
Pankaj
--
*P* | (415) 677-9222 ext. 205*F *| (415) 677-0895 |
[email protected] <mailto:[email protected]>
Pankaj Gupta | Software Engineer
*BrightRoll, Inc. *| Smart Video Advertising | www.brightroll.com
<http://www.brightroll.com/>
United States | Canada | United Kingdom | Germany
We're hiring
<http://newton.newtonsoftware.com/career/CareerHome.action?clientId=8a42a12b3580e2060135837631485aa7>!