Hi, Basically it would be perfect if you first test with a small amount of data in local mode and then run the script on the big data to verify the correctness. If this is not possible you can store a relation at any point of your script with a STORE statement, so not to lose intermediate results. And then you can remove the STORE's after debugging.
Best Regards, Ruslan On Fri, Oct 19, 2012 at 1:18 PM, Jagat Singh <[email protected]> wrote: > Hello , > > I understand the pain :) > > Have you seen PigUnit and Penny > > http://pig.apache.org/docs/r0.10.0/test.html > > > > On Fri, Oct 19, 2012 at 8:09 PM, Yang <[email protected]> wrote: > >> one of the greatest pains I face with debugging a pig code is that the >> iteration cycles are really long: >> the applications for which we use pig typically deal with large dataset, >> and if a pig script involves many >> JOIN/generate/filter steps, every step takes a lot of time, but every time >> I fix one step, I have to run from the start, >> which is meaningless. >> >> what I am doing so far to reduce the meaningless wasted time to re-run >> already-debugged steps, is to >> manually divide my script into many small scripts, and save the last >> variable out into hdfs, and once the >> small script is debugged fine, I load the previous variable in the next >> small script >> >> after all small scripts are done, I connect them back manually to the >> original big script. >> >> >> is there a way to automate this? for example add a mark around a particular >> step, and tells pig >> that the result is to be saved up, and all following steps are not to be >> executed. and when we move >> onto the next step, it knows where to pick up the last-saved data. >> >> writing a preprocessor to do the above is not trivial so that I can't whip >> up something immediately , cuz it needs to figure out the >> schemas of variables that propagate through the steps. >> >> >> Thanks >> Yang >>
