As for: >the >best scenario is to put a "marker" so that certain variables are stored or >skipped computation but instead LOADed I remember there was some discussion on this in the past. Actually this is not trivial. What would it do if you changed a UDF internal code, for example? How would it know that it should reprocess instead of load? As far as I remember some other problems were mentioned.
Ruslan On Fri, Oct 19, 2012 at 11:01 PM, Yang <[email protected]> wrote: > I am using PigUnit, but it's somewhat limited: it can run only localmode, > so I can't find issues that come with fairly large test data; you have to > create small snippets of code that you cut out manually from your original > code, so after you tested a snippet to be fine, you have to copy-paste that > back into the production code, which introduces possible copy-paste errors. > if you compare this to java junit, this is really very crude: in java, you > have a class, and you can do junit testing on individual methods of the > class, instead of having to copy paste and create a special "test version" > of that class. > > > overall, I feel that testability is an area where PIG could spend a lot > more efforts and it will greatly benefit its wider adoption. ----- some > other tools (Cascading, Cascalog etc) advertise testability as one of their > important features. > > let me check out penny... thanks > > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[email protected]> wrote: > >> Hello , >> >> I understand the pain :) >> >> Have you seen PigUnit and Penny >> >> http://pig.apache.org/docs/r0.10.0/test.html >> >> >> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[email protected]> wrote: >> >> > one of the greatest pains I face with debugging a pig code is that the >> > iteration cycles are really long: >> > the applications for which we use pig typically deal with large dataset, >> > and if a pig script involves many >> > JOIN/generate/filter steps, every step takes a lot of time, but every >> time >> > I fix one step, I have to run from the start, >> > which is meaningless. >> > >> > what I am doing so far to reduce the meaningless wasted time to re-run >> > already-debugged steps, is to >> > manually divide my script into many small scripts, and save the last >> > variable out into hdfs, and once the >> > small script is debugged fine, I load the previous variable in the next >> > small script >> > >> > after all small scripts are done, I connect them back manually to the >> > original big script. >> > >> > >> > is there a way to automate this? for example add a mark around a >> particular >> > step, and tells pig >> > that the result is to be saved up, and all following steps are not to be >> > executed. and when we move >> > onto the next step, it knows where to pick up the last-saved data. >> > >> > writing a preprocessor to do the above is not trivial so that I can't >> whip >> > up something immediately , cuz it needs to figure out the >> > schemas of variables that propagate through the steps. >> > >> > >> > Thanks >> > Yang >> > >>
