Hi Roberto,

Setting the roll intervals to 0 will stop the sink rolling the files in HDFS. 
Try setting hdfs.rollCount to the number of messages you want to roll the file 
on (I.e. The number of messages per file). Bare in mind setting this low will 
result in higher HDFS overhead.


--
Chris Horrocks


On Wed, Nov 16, 2016 at 10:35 am, Roberto Coluccio <'[email protected]'> 
wrote:

Hello folks,

I'm testing a Flume agent defined by a topology made of :

JMS source (Tibco implementation) -> memory channel -> hdfs sink

The JMS source has:

- my_agent.sources.my_source.batchSize = 100

The memory channel has:

- my_agent.channels.my_channel.capacity = 100

The HDFS sink has:

- my_agent.sinks.my_sink.hdfs.batchSize = 100
- my_agent.sinks.my_sink.hdfs.rollCount = 0
- my_agent.sinks.my_sink.hdfs.rollInterval = 0
- my_agent.sinks.my_sink.hdfs.idleTimeout = 0

I don't understand how/why new files on HDFS are created/closed. In fact, when 
I:

- launch the agent (JMS queue empty)
- push a new text message on the JMS queue

It happens that a new file is created by the HDFS, but not yet closed (as I 
expect). BUT, when I

3. push again a new text message on the JMS queue

regardles how much time I waited to perform step 3, the HDFS sink closes the 
previously open file, then open a new one for the new incoming message consumed 
from the queue and processed through the channel.

This way, files will always have 1 and only 1 message inside them. I was 
expecting that number to be 100, according to the configuration mentioned above.

Any hints?

Best regards,

Roberto

Reply via email to