On 9 September 2011 01:28, Daniel Dai <[email protected]> wrote: > Yes, makes sense to change it in Pig anyway. The code is in > org.apache.pig.parser.LogicalPlanBuilder.buildNativeOp. You may also need to > change parser to make Load/Store optional. Would you want to give a try?
Having slept on this, I'm not so sure now. If we lose the LOAD/STORE then Pig knows that relation B needs A; but it doesn't see that relation C and relation D are each defined in terms of the (final, complete) result of B. Without this information, how is Pig's execution engine supposed to plan dependencies appropriately? Is there not a risk that these logically sequential jobs are initiated in parallel? Re Shawn's suggestion to drive everything from Python; I'm openminded. Whatever works, really. I've not tried wrapping Pig in Python yet, I've only seen it used for UDFs. Dan >> > A = xxxxx -- Pig pipeline >> > B = MAPREDUCE mahout.jar Store A into >> '<PATH>/content/reuters/reuters-out' >> > seqdirectory –input <PATH>/content/reuters/reuters-out –output >> > <PATH>/content/reuters/seqfiles –charset UTF-8 >> > C = MAPREDUCE mahout.jar seq2sparse –input >> <PATH>/content/reuters/seqfiles >> > –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF >> > D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF' >> > seq2sparse –input<PATH>/content/reuters/seqfiles –output >> > <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF >> > E = foreach D generate .... -- Pig pipeline >> > >> > You only need to interface Pig in the first and last step, but Pig >> requires >> > you to do LOAD/STORE for each job, and that's the problem. If we make >> > Store/Load as optional, that will solve your problem, right?
