Re: What does HDFSSink batch size actually effect?

Juhani Connolly Tue, 14 May 2013 19:44:15 -0700

HDFS batch size determines the number of events to take from the channeland send in one go.

These will be split up into multiple files if bucketted, which is worthconsideration(how many events will get written to each file? If it'sonly a handful, a higher batch size or less files may be desirable)

The size from hdfs -ls will display as 0 but if you actually downloadthe file it should contain everything. Each batch invokes a sync()operation on every bucketwriter. I'm not entirely sure how not havingappend activated might affect this.


On 05/15/2013 03:26 AM, Gary Malouf wrote:

I've previously posted something similar to this on StackOverflow:http://stackoverflow.com/questions/16548358/how-come-flume-ng-hdfs-sink-does-not-write-to-file-when-the-number-of-events-equ
My understanding of batch size from looking at the code in flume-ng1.3.x is that batch size determines at what point data is written tohdfs. With my configuration below, I am not seeing any data writtento file until the rollInterval has passed.
|imp-agent.channels.imp-ch1.type=  memory
imp-agent.channels.imp-ch1.capacity=  40000
imp-agent.channels.imp-ch1.transactionCapacity=  1000

imp-agent.sources.avro-imp-source1.channels=  imp-ch1
imp-agent.sources.avro-imp-source1.type=  avro
imp-agent.sources.avro-imp-source1.bind=  0.0.0.0
imp-agent.sources.avro-imp-source1.port=  41414

imp-agent.sources.avro-imp-source1.interceptors=  host1 timestamp1
imp-agent.sources.avro-imp-source1.interceptors.host1.type=  host
imp-agent.sources.avro-imp-source1.interceptors.host1.useIP=  false
imp-agent.sources.avro-imp-source1.interceptors.timestamp1.type=  timestamp

imp-agent.sinks.hdfs-imp-sink1.channel=  imp-ch1
imp-agent.sinks.hdfs-imp-sink1.type=  hdfs
imp-agent.sinks.hdfs-imp-sink1.hdfs.path=  
hdfs://mynamenode:8020/flume/impressions/yr=%Y/mo=%m/d=%d/logger=%{host}s1/
imp-agent.sinks.hdfs-imp-sink1.hdfs.filePrefix=  Impr
imp-agent.sinks.hdfs-imp-sink1.hdfs.batchSize=  10
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollInterval=  3600
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollCount=  0
imp-agent.sinks.hdfs-imp-sink1.hdfs.rollSize=  66584576

imp-agent.channels=  imp-ch1
imp-agent.sources=  avro-imp-source1
imp-agent.sinks=  hdfs-imp-sink1|
I bring this up as I want to know that after the 'batchSize' number ofmessages are sent to flume that they have been put into HDFS ratherthan waiting for the log roll time to do all of the writing. Mystrong preference if possible is to make sure that data is beingwritten to '.tmp' file throughout the hour and then rolled after the'rollInterval' amount of time has passed.

Re: What does HDFSSink batch size actually effect?

Reply via email to