Look at timestamps of the file? HDFS maintains both mtimes and atimes (latter's not exposed in -ls though).
In ETL context, a simple workflow system also resolves this. You have an incoming directory, a done directory, and a destination directory, etc. and you can move around files pre/post processing for every job, to manage new content/avoid repeated processing (as one simple example). On Wed, Mar 25, 2015 at 11:11 PM, Mich Talebzadeh <[email protected]> wrote: > Hi, > > Have you considered taking snapshot of files at close of business and > compare it with the new snapshot and process only new ones? Just a simple > shell script will do. > > HTH > Let your email find you with BlackBerry from Vodafone > ------------------------------ > *From: * Vijaya Narayana Reddy Bhoomi Reddy < > [email protected]> > *Date: *Wed, 25 Mar 2015 09:55:57 +0000 > *To: *<[email protected]> > *ReplyTo: * [email protected] > *Subject: *Identifying new files on HDFS > > Hi, > > We have a requirement to process only new files in HDFS on a daily basis. > I am sure this is a general requirement in many ETL kind of processing > scenarios. Just wondering if there is a way to identify new files that > are added to a path in HDFS? For example, assume already some files were > present for sometime. Now I have added new files today. So wanted to > process only those new files. What is the best way to achieve this. > > Thanks & Regards > Vijay > > > *Vijay Bhoomireddy*, Big Data Architect > > 1000 Great West Road, Brentford, London, TW8 9DW > *T: +44 20 3475 7980 <%2B44%2020%203475%207980>* > *M: **+44 7481 298 360 <%2B44%207481%20298%20360>* > *W: *ww <http://www.whishworks.com/>w.whishworks.com > <http://www.whishworks.com/> > > <https://www.linkedin.com/company/whishworks> > <http://www.whishworks.com/blog/> <https://twitter.com/WHISHWORKS> > <https://www.facebook.com/whishworksit> > > The contents of this e-mail are confidential and for the exclusive use of > the intended recipient. If you receive this e-mail in error please delete > it from your system immediately and notify us either by e-mail or > telephone. You should not copy, forward or otherwise disclose the content > of the e-mail. The views expressed in this communication may not > necessarily be the view held by WHISHWORKS. > -- Harsh J
