Hi Charles, > Exploring the idea of using "append" instead of creating new files with > HDFS every few minutes.
I wonder if this is doable by setting rollCount to 0 and then using rollInterval (or alternatively rollSize)? > There's certainly a history of append with HDFS, mainly, earlier > versions of Hadoop warn strongly against using file append semantics. Correct, HDFS 1 append did not work and would result in corrupt data. Many users have been using append in HDFS 2 for some time. The only consideration with append is that in certain scenarios a small portion of the file can be corrupted. For example, when writing to a text file, it's possible the client would write a partial line without a newline. Then the client on restart would append to that existing line. The subsequent line would be correctly formatted. Cheers! Brock On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon <[email protected]>wrote: > Not sure what you are trying to do, but the HDFS sink appends. It's just > that you have to determine what your roll-over strategy will be. Instead of > every few minutes, you can set the hdfs.rollInterval=0 (disables) and set > the hdfs.rollSize to however large you want your files before you roll over > to appending to a new file. You can also use hdfs.rollCount to set your > roll-over for a certain number of records. I use rollSize for my roll-over > strategy. > > > On Tue, Apr 8, 2014 at 8:35 PM, Pritchard, Charles X. -ND < > [email protected]> wrote: > >> Exploring the idea of using "append" instead of creating new files with >> HDFS every few minutes. >> Are there particular design decisions / considerations? >> >> There's certainly a history of append with HDFS, mainly, earlier versions >> of Hadoop warn strongly against using file append semantics. >> >> >> -Charles > > >
