Re: Slow write throughput to HDFS

Jeff Lord Mon, 20 Oct 2014 06:23:34 -0700

Pal,

You can add more sinks to your config.
Don't put them in a sink group just have multiple sinks pulling from the
same channel. This should increase your throughput.


Best,

Jeff

On Mon, Oct 20, 2014 at 3:49 AM, Pal Konyves <[email protected]> wrote:

> Hi there,
>
> We would like to write lots of logs to HDFS via Flume, you can imagine
> it as a stress test, or max throughput test. I attached our flume
> config below. The problem is, HDFS writes are freakin' slow. Flume
> writes to HDFS with ~5-10Mbit/s (for us, it means a couple thousand
> events per sec)on a 1Gbit network.
>
> 'hadoop dfs -put' command is ok, I checked the bandwidth usage with
> iftop. The bandwidth usage was ~200Mbits/sec, a 1Gig file is uploded
> within seconds, so HDFS itself should not be slow.
>
> Flume is capable of receiving the messages with around the same speed,
> so the Avro source we use is not the issue, I can see the memory
> channel filling up.
>
> On the other hand, in iftop, I can see, that while Flume receives the
> events fast, it only writes to the datanode with 5-10Mbit/sec, very
> very slow. Why is that? I try to use huge batch-sizes for HDFS sink:
> 10 000 - 100 000 events, because supposedly batch write is always
> faster, but the sink only writes ~ 2 000 - 3 000 events per sec,
> according to the JMX console. Smaller batch sizes (1 000) are not
> faster.
>
> I could not find the magic configuration that makes HDFS sink write
> faster, so I think there is generally something wrong in Flume. I even
> tried the PseudoTx channel to disable transacions, without success of
> improving the write performance.
>
> Setup:
> I have setup of 5 physical machines, strong ones, Dell T5600, 6 core
> xeon, Gigabit networking:
> - Flume Avro client generating log messages
> - Flume Agent
> - HDFS namenode
> - HDFS datanode1
> - HDFS datanode2
>
> +++++++++++++++++++++++++++++
> Configuration: http://pastebin.com/53DGd3wm
>
> agent.sources = r1
> agent.channels = memoryChannel1
> agent.sinks = s1
>
>
> ###############
> # source
> ##############
>
> # For each one of the sources, the type is defined
> agent.sources.r1.type = avro
>
> agent.sources.r1.threads = 10
> #agent.sources.r1.compression-type=deflate
>
> # The channel can be defined as follows.
> agent.sources.r1.channels = memoryChannel1
>
> # thrift specific configuration
> agent.sources.r1.bind = 0.0.0.0
> agent.sources.r1.port = 50414
>
> #############
> # sink
> ############
> # Each sink's type must be defined
> agent.sinks.s1.type = hdfs
> agent.sinks.s1.hdfs.path = hdfs://namenode/flume/events/%D/%H/%M
> agent.sinks.s1.hdfs.filePrefix = flume-events
> agent.sinks.s1.hdfs.fileSuffix = .log
> agent.sinks.s1.hdfs.fileType = DataStream
>
> # round by every 15 minutes
> agent.sinks.s1.hdfs.round = true
> agent.sinks.s1.hdfs.roundValue = 15
> agent.sinks.s1.hdfs.roundUnit = minute
> agent.sinks.s1.hdfs.timeZone = UTC
>
> # never roll based on file size
> agent.sinks.s1.hdfs.rollSize = 0
> # never roll based on count
> agent.sinks.s1.hdfs.rollCount = 0
> # roll on every 1 minute
> agent.sinks.s1.hdfs.rollInterval = 60
> agent.sinks.s1.hdfs.threadsPoolSize = 10
> agent.sinks.s1.hdfs.rollTimerPoolSize = 2
>
> # events written to file before it is flushded to HDFS
> agent.sinks.s1.hdfs.batchSize = 20000
> #Specify the channel the sink should use
> agent.sinks.s1.channel = memoryChannel1
>
> ##############
> # channel
> #############
>
> # Each channel's type is defined.
> # agent.channels.memoryChannel1.type =
> org.apache.flume.channel.PseudoTxnMemoryChannel
> agent.channels.memoryChannel1.type = memory
>
>
> # Other config values specific to each type of channel(sink or source)
> # can be defined as well
> # In this case, it specifies the capacity of the memory channel
> agent.channels.memoryChannel1.capacity = 500000
> agent.channels.memoryChannel1.transactionCapacity = 200000
> agent.channels.memoryChannel1.byteCapacity = 1000000000
>

Re: Slow write throughput to HDFS

Reply via email to