Yes.. My bad.. Been meaning to do it... will try to do it his week. -roshan
From: Hari Shreedharan <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, July 15, 2015 1:41 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: HDFS Sink performance Roshan - how about posting that on the Flume wiki? Thanks, Hari On Wed, Jul 15, 2015 at 1:07 PM, Roshan Naik <[email protected]<mailto:[email protected]>> wrote: Lohit, You may want to search the mailing list for 'Flume perf measurements' . You should find the recent measurements I posted. -roshan From: lohit <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, July 15, 2015 11:19 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: HDFS Sink performance Thanks for the reply Hari. Multiple Sinks make sense, but this would also mean there is lot more files on HDFS. I will try multiple sinks and see how fast this can go to. Given that single HDFS stream can do much higher throughput, may be there is way to have threadpool for SinkRunner-PollingRunner-DefaultSinkProcessor instead of single thread per sink. 2015-07-15 11:11 GMT-07:00 Hari Shreedharan <[email protected]<mailto:[email protected]>>: Hi Lohit, HDFS sinks (in fact, most sinks) are single-threaded by design. This is meant to make writing the sinks easier, but all channels can handle multiple sinks reading from them. So to improve the efficiency, you basically configure several sinks which read off the same channel. Make sure that each sink though writes to files with different HDFS paths or different file prefixes (else HDFS client API will complain about leases). Thanks, Hari On Wed, Jul 15, 2015 at 9:10 AM, lohit <[email protected]<mailto:[email protected]>> wrote: Hello, Does anyone have some numbers which they can share around HDFS sink performance. From our testing, for single sink writing to HDFS (CompressedStream) and reading from MemoryChannel can only do about 35000 events per second (each event is about 1K) in size. After compression this turns out to be ~10MB/s write stream to HDFS file. Which is pretty low. Our configuration looks like this agent.sinks.hdfsSink.type = hdfs agent.sinks.hdfsSink.channel = memoryChannel agent.sinks.hdfsSink.hdfs.path = /tmp/lohit agent.sinks.hdfsSink.hdfs.codeC = lzo agent.sinks.hdfsSink.hdfs.fileType = CompressedStream agent.sinks.hdfsSink.hdfs.writeFormat = Writable agent.sinks.hdfsSink.hdfs.rollInterval = 3600 agent.sinks.hdfsSink.hdfs.rollSize = 1073741824 agent.sinks.hdfsSink.hdfs.rollCount = 0 agent.sinks.hdfsSink.hdfs.batchSize = 10000 agent.sinks.hdfsSink.hdfs.txnEventMax = 10000 agent.channels.memoryChannel.type = memory agent.channels.memoryChannel.capacity = 3000000 agent.channels.memoryChannel.transactionCapacity = 10000 -- Have a Nice Day! Lohit -- Have a Nice Day! Lohit
