On Wed, Apr 9, 2014 at 12:54 PM, Pritchard, Charles X. -ND < [email protected]> wrote:
> > On Apr 9, 2014, at 8:06 AM, Brock Noland <[email protected]> wrote: > > Hi Charles, > > > Exploring the idea of using "append" instead of creating new files with > > HDFS every few minutes. > ... > it's possible the client would write a partial line without a newline. > Then the client on restart would append to that existing line. The > subsequent line would be correctly formatted. > > > Is this an issue with Hadoop architecture or an issue with the way flume > calls/does not call some kind of fsync/sync interface? > Hadoop has append but there's no merge -- be wonderful to just write data > then atomically call "merge this". Never a corrupt file! > > Having a partially appended record would have that unfortunate consequence > of causing fastidious MR jobs to throw errors on occasion. > "Atomic Record Append" is a feature gap between GFS and HDFS. AFAIK there is nothing in HDFS that precludes implementing the feature. As with most items in the storage layer, It's a sizable amount of implementation work. > > > On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon <[email protected] > > wrote: > >> Not sure what you are trying to do, but the HDFS sink appends. It's just >> that you have to determine what your roll-over strategy will be. Instead of >> every few minutes, you can set the hdfs.rollInterval=0 (disables) and set >> the hdfs.rollSize to however large you want your files before you roll over >> to appending to a new file. You can also use hdfs.rollCount to set your >> roll-over for a certain number of records. I use rollSize for my roll-over >> strategy. >> > > Sounds like a good strategy. Do you also access those HDFS files while > they're still being written to -- that is -- do you hit the edge case that > Brock brought up? > > > -Charles >
