Yes, the state of what files have been processed needs to be tracked outside of the script somehow. Two other approaches come to mind as well:
- Use the HDFS file systems as a work queue. Move files from /incoming to /processed for example after processing them. - Put files in a time-partitioned directory and run your jobs for explicit time intervals. This approach is more common. On Wed, Mar 27, 2013 at 7:30 AM, John Farrelly < [email protected]> wrote: > Thanks Mike. That's what I was thinking, but I was wondering if (hoping!) > there was something already to do it :) > > Thanks, > John. > > -----Original Message----- > From: Mike Sukmanowsky [mailto:[email protected]] > Sent: 27 March 2013 14:05 > To: [email protected] > Subject: Re: Don't process already processed files? > > It's probably less work to have some kind of a script control Pig > execution and keep track of what's been processed and pass in an input path > to your Pig script dynamically. For example, you could create a > control.py/rb/shfile which would somehow keep track of what's been > processed (maybe a simple file) and then figure out the input path to pass > to pig during execution via a parameter: pig --param > inputpath="/some/dynamic/input/path/for/pig". > > You'd then setup your cron job to run your control script instead of the > Pig script directly. > > > On Wed, Mar 27, 2013 at 6:24 AM, John Farrelly < > [email protected]> wrote: > > > Hi there, > > > > In our system, we have multiple pig scripts that run against a > > particular HDFS directory. The pig scripts can run at different > > times, and are scheduled to run regularly. Is there a way to point a > > pig script at the same directory for multiple executions, but make > > sure that it only processed new files that it hasn't seen before? I > > was thinking of using a custom PathFilter for my loader, but I thought > > I would ask to see if there is already a way to do this, rather than me > reinventing the wheel (!). > > > > Thanks, > > John. > > </pre>**************************************************************** > > ************************<br>This email and any files transmitted with > > are confidential and intended solely for the<br>use of the individual > > or entity to whom they are addressed. If you have received > > this<br>email in error then please delete it and notify the sender. Do > > not make a copy or forward<br>it to anyone. This footnote also > > confirms that this email message has been swept for the<br>presence of > > computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48 > > Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G. > > Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers > > (UK).<br>Registered in Ireland, Company No. 370343, VAT > > Reg.No.IE6390343O<br>************************************************* > > ***************************************</pre> > > > > > > -- > Mike Sukmanowsky > > Product Lead, http://parse.ly > 989 Avenue of the Americas, 3rd Floor > New York, NY 10018 > p: +1 (416) 953-4248 > e: [email protected] > </pre>****************************************************************************************<br>This > email and any files transmitted with are confidential and intended solely > for the<br>use of the individual or entity to whom they are addressed. If > you have received this<br>email in error then please delete it and notify > the sender. Do not make a copy or forward<br>it to anyone. This footnote > also confirms that this email message has been swept for the<br>presence of > computer viruses.<br><br>Adaptive Mobile Security Ltd, Ferry House, 48 > Lower Mount Street, Dublin 2, Ireland<br>Directors: B. Collins, G. > Maclachlan (UK), N. Grierson (UK), J. Ennis (UK), D. Summers > (UK).<br>Registered in Ireland, Company No. 370343, VAT > Reg.No.IE6390343O<br>****************************************************************************************</pre> > > -- *Note that I'm no longer using my Yahoo! email address. Please email me at [email protected] going forward.*
