ok, I found this practice to be useful:
I divide my code into sections, each section implemented as a macro. then I debug each macro separately, at the end of each macro, I manually write its output vars into tmp storage. Then for each macro, I write a corresponding "***_fake.pig" macro, which has the same signature, but populates the same return vars by loading them from the tmp storage. then after I am done with one section, I swap out the IMPORT sentence to import the **_fake.pig script instead, so that the same computation is not done again. On Tue, Oct 23, 2012 at 11:11 AM, Yang <[email protected]> wrote: > nice, thanks > > macros and mock.Storage() are both new to me, I believe it will help a lot > > > On Mon, Oct 22, 2012 at 5:32 PM, Dmitriy Ryaboy <[email protected]>wrote: > >> Some testing tips: >> >> 1) parametrize your load/store statements so that if you have to run >> in hadoop mode, it's easy to switch to debug inputs / outputs (and >> debug input/output loaders and storers). It's vastly preferable to >> test in local mode when possible, since the iterations are so much >> faster. >> >> 2) it's a good thing that PigUnit makes you test small pieces of code! >> Factor out macros so that you can create unit tests; don't copy and >> paste code, use macros and the import statement. >> >> 3) Try using mock.Storage (see >> https://issues.apache.org/jira/browse/PIG-2650) to automatically >> create inputs and examine outputs in your unit tests, if you are on >> pig 11. >> >> D >> >> On Fri, Oct 19, 2012 at 12:01 PM, Yang <[email protected]> wrote: >> > I am using PigUnit, but it's somewhat limited: it can run only >> localmode, >> > so I can't find issues that come with fairly large test data; you have >> to >> > create small snippets of code that you cut out manually from your >> original >> > code, so after you tested a snippet to be fine, you have to copy-paste >> that >> > back into the production code, which introduces possible copy-paste >> errors. >> > if you compare this to java junit, this is really very crude: in java, >> you >> > have a class, and you can do junit testing on individual methods of the >> > class, instead of having to copy paste and create a special "test >> version" >> > of that class. >> > >> > >> > overall, I feel that testability is an area where PIG could spend a lot >> > more efforts and it will greatly benefit its wider adoption. ----- some >> > other tools (Cascading, Cascalog etc) advertise testability as one of >> their >> > important features. >> > >> > let me check out penny... thanks >> > >> > On Fri, Oct 19, 2012 at 2:18 AM, Jagat Singh <[email protected]> >> wrote: >> > >> >> Hello , >> >> >> >> I understand the pain :) >> >> >> >> Have you seen PigUnit and Penny >> >> >> >> http://pig.apache.org/docs/r0.10.0/test.html >> >> >> >> >> >> >> >> On Fri, Oct 19, 2012 at 8:09 PM, Yang <[email protected]> wrote: >> >> >> >> > one of the greatest pains I face with debugging a pig code is that >> the >> >> > iteration cycles are really long: >> >> > the applications for which we use pig typically deal with large >> dataset, >> >> > and if a pig script involves many >> >> > JOIN/generate/filter steps, every step takes a lot of time, but every >> >> time >> >> > I fix one step, I have to run from the start, >> >> > which is meaningless. >> >> > >> >> > what I am doing so far to reduce the meaningless wasted time to >> re-run >> >> > already-debugged steps, is to >> >> > manually divide my script into many small scripts, and save the last >> >> > variable out into hdfs, and once the >> >> > small script is debugged fine, I load the previous variable in the >> next >> >> > small script >> >> > >> >> > after all small scripts are done, I connect them back manually to the >> >> > original big script. >> >> > >> >> > >> >> > is there a way to automate this? for example add a mark around a >> >> particular >> >> > step, and tells pig >> >> > that the result is to be saved up, and all following steps are not >> to be >> >> > executed. and when we move >> >> > onto the next step, it knows where to pick up the last-saved data. >> >> > >> >> > writing a preprocessor to do the above is not trivial so that I can't >> >> whip >> >> > up something immediately , cuz it needs to figure out the >> >> > schemas of variables that propagate through the steps. >> >> > >> >> > >> >> > Thanks >> >> > Yang >> >> > >> >> >> > >
