These are log files being deposited by other processes, which we may not have 
control over.

We don't want multiple processes to write to the same files — we just don't 
want to start our jobs until they have been completely written.

Sorry for lack of clarity & thanks for the response.


--Pete

From: Bertrand Dechoux <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, September 25, 2012 12:33 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Detect when file is not being written by another process

Hi,

Multiple files and aggregation or something like hbase?

Could you tell use more about your context? What are the volumes? Why do you 
want multiple processes to write to the same file?

Regards

Bertrand

On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan 
<[email protected]<mailto:[email protected]>> wrote:
Hi all.

We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB) files when 
they've finished being written to HDFS by a different process.  There doesn't 
appear to be an API specifically for this.  We had discovered through 
experimentation that the FileSystem.append() method can be used for this 
purpose — it will fail if another process is writing to the file.

However: when running this on a multi-node cluster, using that API actually 
corrupts the file.  Perhaps this is a known issue?  Looking at the bug tracker 
I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of 
similar-sounding things.

What's the right way to solve this problem?  Thanks.


--Pete




--
Bertrand Dechoux

Reply via email to