Hello, I'd like to understand more the Storm-HDFS bolt, specifically the Trident implementation as present in the external/storm-hdfs project.
First, can anyone confirm that HdfsState is not transactional? Meaning, exactly-once write semantics do not apply? In a Trident topology that uses this Hdfs bolt, when the worker task that is writing to HDFS fails in the middle of a write, what state is the file in? The file hasnt been closed and so the data block being written to might not have been flushed to disk on the datanode despite the sync () call.... but will that datablock become readable at all after a small time window since the file is now orphaned (see next question)? In the above failure, I noticed that the Supervisor fired up another worker which started writing to another file in the same directory. This means the data file for the previous task is now orphaned (e.g so won't be rotated etc). Is that a right understanding? How have others dealt with such orphaned files? Can anyone provide an example of a failure that will result in the destination HDFS file becoming corrupted due to an incomplete write? For instance, I suppose the out.write () call could face a network issue as a result of which only a portion of a message went through. On another note, why do I see the following sequence of messages in the worker log long after the data is written to HDFS? "Processing received message source..." -> "Emitting ... __ack_ack..." presumably these are keep-alive log messages for the topology, in which case can the frequency of logging be controlled/tuned? Thanks Ranga
