Re: flume and hadoop append

Brock Noland Wed, 09 Apr 2014 08:19:56 -0700

Hi Charles,

> Exploring the idea of using "append" instead of creating new files with
> HDFS every few minutes.

I wonder if this is doable by setting rollCount to 0 and then using
rollInterval (or alternatively rollSize)?

> There's certainly a history of append with HDFS, mainly, earlier
> versions of Hadoop warn strongly against using file append semantics.

Correct, HDFS 1 append did not work and would result in corrupt data. Many
users have been using append in HDFS 2 for some time. The only
consideration with append is that in certain scenarios a small portion of
the file can be corrupted. For example, when writing to a text file, it's
possible the client would write a partial line without a newline. Then the
client on restart would append to that existing line. The subsequent line
would be correctly formatted.

Cheers!
Brock

On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon
<[email protected]>wrote:

> Not sure what you are trying to do, but the HDFS sink appends. It's just
> that you have to determine what your roll-over strategy will be. Instead of
> every few minutes, you can set the hdfs.rollInterval=0 (disables) and set
> the hdfs.rollSize to however large you want your files before you roll over
> to appending to a new file. You can also use hdfs.rollCount to set your
> roll-over for a certain number of records. I use rollSize for my roll-over
> strategy.
>
>
> On Tue, Apr 8, 2014 at 8:35 PM, Pritchard, Charles X. -ND <
> [email protected]> wrote:
>
>> Exploring the idea of using "append" instead of creating new files with
>> HDFS every few minutes.
>> Are there particular design decisions / considerations?
>>
>> There's certainly a history of append with HDFS, mainly, earlier versions
>> of Hadoop warn strongly against using file append semantics.
>>
>>
>> -Charles
>
>
>

Re: flume and hadoop append

Reply via email to