AFAIK there is not way to determine i a file has been fully written or not.
Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in the output directory of a job. This _SUCCESS file is written at job completion time, thus ensuring all the output of the job is ready. This means that when Oozie is configured to look for a directory FOO/, in practice it looks for the existence of FOO/_SUCCESS file. You can configure Oozie to look for existence of FOO/ but this means you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and do a rename to FOO/ once you finished writing the data. Thx On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala <[email protected]> wrote: > Agree with Bejoy. The problem you've mentioned sounds like building > something like a workflow, which is what Oozie is supposed to do. > > Thanks > hemanth > > > On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks <[email protected]> wrote: >> >> Hi Peter >> >> AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as >> soon as the files are written to a certain hdfs directory. >> >> >> On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan >> <[email protected]> wrote: >>> >>> These are log files being deposited by other processes, which we may not >>> have control over. >>> >>> We don't want multiple processes to write to the same files — we just >>> don't want to start our jobs until they have been completely written. >>> >>> Sorry for lack of clarity & thanks for the response. >>> >>> >>> --Pete >>> >>> From: Bertrand Dechoux <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Tuesday, September 25, 2012 12:33 PM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Detect when file is not being written by another process >>> >>> Hi, >>> >>> Multiple files and aggregation or something like hbase? >>> >>> Could you tell use more about your context? What are the volumes? Why do >>> you want multiple processes to write to the same file? >>> >>> Regards >>> >>> Bertrand >>> >>> On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan >>> <[email protected]> wrote: >>>> >>>> Hi all. >>>> >>>> We're using Hadoop 1.0.3. We need to pick up a set of large (4+GB) >>>> files when they've finished being written to HDFS by a different process. >>>> There doesn't appear to be an API specifically for this. We had discovered >>>> through experimentation that the FileSystem.append() method can be used for >>>> this purpose — it will fail if another process is writing to the file. >>>> >>>> However: when running this on a multi-node cluster, using that API >>>> actually corrupts the file. Perhaps this is a known issue? Looking at the >>>> bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a >>>> bunch >>>> of similar-sounding things. >>>> >>>> What's the right way to solve this problem? Thanks. >>>> >>>> >>>> --Pete >>>> >>> >>> >>> >>> -- >>> Bertrand Dechoux >> >> > -- Alejandro
