On Apr 9, 2014, at 8:06 AM, Brock Noland <[email protected]<mailto:[email protected]>> wrote:
Hi Charles, > Exploring the idea of using “append” instead of creating new files with > HDFS every few minutes. ... it's possible the client would write a partial line without a newline. Then the client on restart would append to that existing line. The subsequent line would be correctly formatted. Is this an issue with Hadoop architecture or an issue with the way flume calls/does not call some kind of fsync/sync interface? Hadoop has append but there’s no merge — be wonderful to just write data then atomically call “merge this”. Never a corrupt file! Having a partially appended record would have that unfortunate consequence of causing fastidious MR jobs to throw errors on occasion. On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon <[email protected]<mailto:[email protected]>> wrote: Not sure what you are trying to do, but the HDFS sink appends. It's just that you have to determine what your roll-over strategy will be. Instead of every few minutes, you can set the hdfs.rollInterval=0 (disables) and set the hdfs.rollSize to however large you want your files before you roll over to appending to a new file. You can also use hdfs.rollCount to set your roll-over for a certain number of records. I use rollSize for my roll-over strategy. Sounds like a good strategy. Do you also access those HDFS files while they’re still being written to — that is — do you hit the edge case that Brock brought up? -Charles
