Could save some metadata like crcs of all the jars... And maybe a hash of the subplan associated with each stored intermediate output... But really we should just do Nectar since it solves all this and more :)
On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <[email protected]> wrote: > Well, you can do this physically by adding load/store boundaries to your > code. Thinking out loud, such a thing could be possible... > > At any M/R boundary, you store the intermediate in HDFS, and pig is aware > of this and doesn't automatically delete it (this part in and of itself is > not trivial -- what manages the garbage collection? perhaps that could be > part of the configuration of such a feature). Then, when you rerun a job, > it will look to see if the nodes that it would have saved (since it knows > this at compile time) don't already actually exist. > > There are some tricky caveats here... what if your code changes affect > intermediate data? You could save the logical plan as well, but what if you > make a change to a UDF? I am not sure if the benefit of automating this in > the language compared to developing a workflow similar to yours external to > pig is worth the complexity. > > But it is intriguing, and is a subset of data caching that we have thought > a lot about here. > > 2012/6/15 Russell Jurney <[email protected]> > >> In production I use short Pig scripts and schedule them with Azkaban >> with dependencies setup, so that I can use Azkaban to restart long >> data pipelines at the point of failure. I edit the failing pig script, >> usually towards the end of the data pipeline, and restart the Azkaban >> job. This saves hours and hours of repeated processing. >> >> I wish Pig could do this. To resume at its point of failure when >> re-run from the command line. Is this feasible? >> >> Russell Jurney >> twitter.com/rjurney >> [email protected] >> datasyndrome.com >>
