I am using PigUnit, but it's somewhat limited: it can run only localmode, so I can't find issues that come with fairly large test data; you have to create small snippets of code that you cut out manually from your original code, so after you tested a snippet to be fine, you have to copy-paste that back into the production code, which introduces possible copy-paste errors. if you compare this to java junit, this is really very crude: in java, you have a class, and you can do junit testing on individual methods of the class, instead of having to copy paste and create a special "test version" of that class.
overall, I feel that testability is an area where PIG could spend a lot more efforts and it will greatly benefit its wider adoption. ----- some other tools (Cascading, Cascalog etc) advertise testability as one of their important features. let me check out penny... thanks On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[email protected]> wrote: > Hello , > > I understand the pain :) > > Have you seen PigUnit and Penny > > http://pig.apache.org/docs/r0.10.0/test.html > > > > On Fri, Oct 19, 2012 at 8:09 PM, Yang <[email protected]> wrote: > > > one of the greatest pains I face with debugging a pig code is that the > > iteration cycles are really long: > > the applications for which we use pig typically deal with large dataset, > > and if a pig script involves many > > JOIN/generate/filter steps, every step takes a lot of time, but every > time > > I fix one step, I have to run from the start, > > which is meaningless. > > > > what I am doing so far to reduce the meaningless wasted time to re-run > > already-debugged steps, is to > > manually divide my script into many small scripts, and save the last > > variable out into hdfs, and once the > > small script is debugged fine, I load the previous variable in the next > > small script > > > > after all small scripts are done, I connect them back manually to the > > original big script. > > > > > > is there a way to automate this? for example add a mark around a > particular > > step, and tells pig > > that the result is to be saved up, and all following steps are not to be > > executed. and when we move > > onto the next step, it knows where to pick up the last-saved data. > > > > writing a preprocessor to do the above is not trivial so that I can't > whip > > up something immediately , cuz it needs to figure out the > > schemas of variables that propagate through the steps. > > > > > > Thanks > > Yang > > >
