Problems with HDFS Sink (file rolling)

Christian Schroer Wed, 01 Aug 2012 05:42:28 -0700

Hi,

i have some trouble setting up the HDFS sink in Flume-NG (CDH3U4, 1.1.0):


Here's my sink configuration:

agent.sinks.hdfsSinkSMP.type = hdfs
agent.sinks.hdfsSinkSMP.channel = memoryChannel
agent.sinks.hdfsSinkSMP.hdfs.filePrefix = flumenode1
agent.sinks.hdfsSinkSMP.hdfs.fileType = SequenceFile
agent.sinks.hdfsSinkSMP.hdfs.codeC = gzip
agent.sinks.hdfsSinkSMP.hdfs.rollCount = 0
agent.sinks.hdfsSinkSMP.hdfs.batchSize = 1
agent.sinks.hdfsSinkSMP.hdfs.rollInterval = 15
agent.sinks.hdfsSinkSMP.hdfs.rollSize = 0
agent.sinks.hdfsSinkSMP.hdfs.path = 
hdfs://namenode/user/hive/warehouse/someDatabase.db 
/someTable/%Y-%m-%d/%H00/%M/somePartion

Events are genereated by a SyslogTcp source. We write the data into hive 
partions. This works, it just keeps open a lot of .tmp files. I disabled event 
count and size based file rolling, just enabled the interval to have the files 
closed after 15 seconds. But flume keeps files open much longer than 15 seconds 
(sometimes for hours or even never closing them). Also stopping flume keeps 
.tmp files in those directories. Sometimes it opens new files in partions 
without having any data for those. Maybe I'm doing the file rolling completely 
wrong?

Some hive jobs use 5 minutes old data, but if flume renames a file after job 
start, the job fails. That's the reason why I want to close the files after 15 
seconds. New files are no problems.

Anyone has an idea?

Best regards,
Christian

Problems with HDFS Sink (file rolling)

Reply via email to