What's nectar? I'd like this feature because Pig is easier to read than Oozie XML or Azkaban YAML/ JSON where one must manually specify dependencies. Lipstick is a good example of using Pig this way?
Russell Jurney twitter.com/rjurney [email protected] datasyndrome.com On Jun 16, 2012, at 8:27 AM, Dmitriy Ryaboy <[email protected]> wrote: > Could save some metadata like crcs of all the jars... And maybe a hash of the > subplan associated with each stored intermediate output... But really we > should just do Nectar since it solves all this and more :) > > On Jun 15, 2012, at 10:43 PM, Jonathan Coveney <[email protected]> wrote: > >> Well, you can do this physically by adding load/store boundaries to your >> code. Thinking out loud, such a thing could be possible... >> >> At any M/R boundary, you store the intermediate in HDFS, and pig is aware >> of this and doesn't automatically delete it (this part in and of itself is >> not trivial -- what manages the garbage collection? perhaps that could be >> part of the configuration of such a feature). Then, when you rerun a job, >> it will look to see if the nodes that it would have saved (since it knows >> this at compile time) don't already actually exist. >> >> There are some tricky caveats here... what if your code changes affect >> intermediate data? You could save the logical plan as well, but what if you >> make a change to a UDF? I am not sure if the benefit of automating this in >> the language compared to developing a workflow similar to yours external to >> pig is worth the complexity. >> >> But it is intriguing, and is a subset of data caching that we have thought >> a lot about here. >> >> 2012/6/15 Russell Jurney <[email protected]> >> >>> In production I use short Pig scripts and schedule them with Azkaban >>> with dependencies setup, so that I can use Azkaban to restart long >>> data pipelines at the point of failure. I edit the failing pig script, >>> usually towards the end of the data pipeline, and restart the Azkaban >>> job. This saves hours and hours of repeated processing. >>> >>> I wish Pig could do this. To resume at its point of failure when >>> re-run from the command line. Is this feasible? >>> >>> Russell Jurney >>> twitter.com/rjurney >>> [email protected] >>> datasyndrome.com >>>
