Hello,

I'd like to understand more the Storm-HDFS bolt, specifically the Trident
implementation as present in the external/storm-hdfs project.

First, can anyone confirm that HdfsState is not transactional? Meaning,
exactly-once write semantics do not apply?

In a Trident topology that uses this Hdfs bolt, when the worker task that
is writing to HDFS fails in the middle of a write, what state is the file
in? The file hasnt been closed and so the data block being written to might
not have been flushed to disk on the datanode despite the sync () call....
but will that datablock become readable at all after a small time window
since the file is now orphaned (see next question)?

In the above failure,  I noticed that the Supervisor fired up another
worker which started writing to another file in the same directory. This
means the data file for the previous task is now orphaned (e.g so won't be
rotated etc). Is that a right understanding? How have others dealt with
such orphaned files?

Can anyone provide an example of a failure that will result in the
destination HDFS file becoming corrupted due to an incomplete write? For
instance, I suppose the out.write () call could face a network issue as a
result of which only a portion of a message went through.

On another note, why do I see the following sequence of messages in the
worker log long after the data is written to HDFS? "Processing received
message source..." -> "Emitting ... __ack_ack..." presumably these are
keep-alive log messages for the topology,  in which case can the frequency
of logging be controlled/tuned?

Thanks
Ranga

Reply via email to