Re: HDFS Sink performance

Roshan Naik Wed, 15 Jul 2015 13:45:41 -0700

Yes.. My bad.. Been meaning to do it... will try to do it his week.
-roshan

From: Hari Shreedharan 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, July 15, 2015 1:41 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: HDFS Sink performance

Roshan - how about posting that on the Flume wiki?

Thanks,
Hari

On Wed, Jul 15, 2015 at 1:07 PM, Roshan Naik 
<[email protected]<mailto:[email protected]>> wrote:
Lohit,
You may want to search the mailing list for 'Flume perf measurements' . You 
should find the recent measurements I posted.
-roshan

From: lohit <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, July 15, 2015 11:19 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: HDFS Sink performance

Thanks for the reply Hari. Multiple Sinks make sense, but this would also mean 
there is lot more files on HDFS. I will try multiple sinks and see how fast 
this can go to.
Given that single HDFS stream can do much higher throughput, may be there is 
way to have threadpool for SinkRunner-PollingRunner-DefaultSinkProcessor 
instead of single thread per sink.

2015-07-15 11:11 GMT-07:00 Hari Shreedharan 
<[email protected]<mailto:[email protected]>>:
Hi Lohit,

HDFS sinks (in fact, most sinks) are single-threaded by design. This is meant 
to make writing the sinks easier, but all channels can handle multiple sinks 
reading from them. So to improve the efficiency, you basically configure 
several sinks which read off the same channel. Make sure that each sink though 
writes to files with different HDFS paths or different file prefixes (else HDFS 
client API will complain about leases).

Thanks,
Hari

On Wed, Jul 15, 2015 at 9:10 AM, lohit 
<[email protected]<mailto:[email protected]>> wrote:
Hello,

Does anyone have some numbers which they can share around HDFS sink 
performance. From our testing, for single sink writing to HDFS 
(CompressedStream) and reading from MemoryChannel can only do about 35000 
events per second (each event is about 1K) in size. After compression this 
turns out to be ~10MB/s write stream to HDFS file. Which is pretty low. Our 
configuration looks like this

agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.channel = memoryChannel
agent.sinks.hdfsSink.hdfs.path = /tmp/lohit
agent.sinks.hdfsSink.hdfs.codeC = lzo
agent.sinks.hdfsSink.hdfs.fileType = CompressedStream
agent.sinks.hdfsSink.hdfs.writeFormat = Writable
agent.sinks.hdfsSink.hdfs.rollInterval = 3600
agent.sinks.hdfsSink.hdfs.rollSize = 1073741824
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.batchSize = 10000
agent.sinks.hdfsSink.hdfs.txnEventMax = 10000

agent.channels.memoryChannel.type = memory

agent.channels.memoryChannel.capacity = 3000000
agent.channels.memoryChannel.transactionCapacity = 10000

--
Have a Nice Day!
Lohit

--
Have a Nice Day!
Lohit

Re: HDFS Sink performance

Reply via email to