On Apr 9, 2014, at 8:06 AM, Brock Noland 
<[email protected]<mailto:[email protected]>> wrote:

Hi Charles,

> Exploring the idea of using “append” instead of creating new files with
> HDFS every few minutes.
...
it's possible the client would write a partial line without a newline. Then the 
client on restart would append to that existing line. The subsequent line would 
be correctly formatted.

Is this an issue with Hadoop architecture or an issue with the way flume 
calls/does not call some kind of fsync/sync interface?
Hadoop has append but there’s no merge — be wonderful to just write data then 
atomically call “merge this”. Never a corrupt file!

Having a partially appended record would have that unfortunate consequence of 
causing fastidious MR jobs to throw errors on occasion.


On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon 
<[email protected]<mailto:[email protected]>> wrote:
Not sure what you are trying to do, but the HDFS sink appends. It's just that 
you have to determine what your roll-over strategy will be. Instead of every 
few minutes, you can set the hdfs.rollInterval=0 (disables) and set the 
hdfs.rollSize to however large you want your files before you roll over to 
appending to a new file. You can also use hdfs.rollCount to set your roll-over 
for a certain number of records. I use rollSize for my roll-over strategy.

Sounds like a good strategy. Do you also access those HDFS files while they’re 
still being written to — that is — do you hit the edge case that Brock brought 
up?


-Charles

Reply via email to