These are log files being deposited by other processes, which we may not have control over.
We don't want multiple processes to write to the same files — we just don't want to start our jobs until they have been completely written. Sorry for lack of clarity & thanks for the response. --Pete From: Bertrand Dechoux <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Tuesday, September 25, 2012 12:33 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Detect when file is not being written by another process Hi, Multiple files and aggregation or something like hbase? Could you tell use more about your context? What are the volumes? Why do you want multiple processes to write to the same file? Regards Bertrand On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan <[email protected]<mailto:[email protected]>> wrote: Hi all. We're using Hadoop 1.0.3. We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process. There doesn't appear to be an API specifically for this. We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file. However: when running this on a multi-node cluster, using that API actually corrupts the file. Perhaps this is a known issue? Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things. What's the right way to solve this problem? Thanks. --Pete -- Bertrand Dechoux
