Hi,

I'm trying to write events to HDFS using Flume 1.2.0 and I have a couple
of questions.

Firstly, about the reliability semantics of the HdfsEventSink.

My number one requirement is reliability, i.e. not losing any events.
Ideally, by the time the HdfsEventSink commits the transaction, all
events should be safely written to HDFS and visible to other clients, so
that no data is lost even if the agent dies after that point. But what
is actually happening in my tests is as follows:

1. The HDFS sink takes some events from the FileChannel and writes them
to a SequenceFile on HDFS
2. The sink commits the transaction, and the FileChannel updates its
checkpoint. As far as FileChannel is concerned, the events have been
safely written to the sink.
3. Kill the agent.

Result: I'm left with a weird zero-byte, non-zero-byte tmp file on HDFS.
The SequenceFile has not yet been closed and rolled over, so it is still
a ".tmp" file. The data is actually in the HDFS blocks, but because the
file was not closed, the NameNode thinks it has a length of 0 bytes. I'm
not sure how to recover from this.

Is this the expected behaviour of the HDFS sink, or am I doing something
wrong? Do I need to explicitly enable HDFS append? (I am using HDFS
2.0.0-alpha)

I guess the problem is that data is not "safely" written until file
rollover occurs, but the timing of file rollover (by time, log count,
file size, etc.) is unrelated to the timing of transactions. Is there
any way to put these in sync with each other?

Second question: Could somebody please explain the reasoning behind the
default values of the HDFS sink configuration? If I use the defaults,
the sink generates zillions of tiny files (max 10 events per file),
which as I understand it is not a recommended way to use HDFS.

Is it OK to change these settings to generate much larger files (MB, GB
scale)? Or should I write a script that periodically combines these tiny
files into larger ones?

Thanks for any advice,

Chris Birchall.



Reply via email to