Re: flume and hadoop append

Brock Noland Thu, 10 Apr 2014 07:56:02 -0700

On Wed, Apr 9, 2014 at 12:54 PM, Pritchard, Charles X. -ND <
[email protected]> wrote:


>
> On Apr 9, 2014, at 8:06 AM, Brock Noland <[email protected]> wrote:
>
> Hi Charles,
>
> > Exploring the idea of using "append" instead of creating new files with
> > HDFS every few minutes.
> ...
> it's possible the client would write a partial line without a newline.
> Then the client on restart would append to that existing line. The
> subsequent line would be correctly formatted.
>
>
> Is this an issue with Hadoop architecture or an issue with the way flume
> calls/does not call some kind of fsync/sync interface?
> Hadoop has append but there's no merge -- be wonderful to just write data
> then atomically call "merge this". Never a corrupt file!
>
> Having a partially appended record would have that unfortunate consequence
> of causing fastidious MR jobs to throw errors on occasion.
>

"Atomic Record Append" is a feature gap between GFS and HDFS. AFAIK there
is nothing in HDFS that precludes implementing the feature. As with most
items in the storage layer, It's a sizable amount of implementation work.


>
>
> On Tue, Apr 8, 2014 at 9:00 PM, Christopher Shannon <[email protected]
> > wrote:
>
>> Not sure what you are trying to do, but the HDFS sink appends. It's just
>> that you have to determine what your roll-over strategy will be. Instead of
>> every few minutes, you can set the hdfs.rollInterval=0 (disables) and set
>> the hdfs.rollSize to however large you want your files before you roll over
>> to appending to a new file. You can also use hdfs.rollCount to set your
>> roll-over for a certain number of records. I use rollSize for my roll-over
>> strategy.
>>
>
> Sounds like a good strategy. Do you also access those HDFS files while
> they're still being written to -- that is -- do you hit the edge case that
> Brock brought up?
>
>
> -Charles
>

Re: flume and hadoop append

Reply via email to