Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Dan Brickley Fri, 09 Sep 2011 10:08:34 -0700

On 9 September 2011 01:28, Daniel Dai <[email protected]> wrote:
> Yes, makes sense to change it in Pig anyway. The code is in
> org.apache.pig.parser.LogicalPlanBuilder.buildNativeOp. You may also need to
> change parser to make Load/Store optional. Would you want to give a try?


Having slept on this, I'm not so sure now. If we lose the LOAD/STORE
then Pig knows that relation B needs A; but it doesn't see that
relation C and relation D are each defined in terms of the (final,
complete) result of B.

Without this information, how is Pig's execution engine supposed to
plan dependencies appropriately? Is there not a risk that these
logically sequential jobs are initiated in parallel?

Re Shawn's suggestion to drive everything from Python; I'm openminded.
Whatever works, really. I've not tried wrapping Pig in Python yet,
I've only seen it used for UDFs.

Dan

>> > A = xxxxx -- Pig pipeline
>> > B = MAPREDUCE mahout.jar Store A into
>> '<PATH>/content/reuters/reuters-out'
>> > seqdirectory –input <PATH>/content/reuters/reuters-out –output
>> > <PATH>/content/reuters/seqfiles –charset UTF-8
>> > C = MAPREDUCE mahout.jar seq2sparse –input
>> <PATH>/content/reuters/seqfiles
>> > –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
>> > D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
>> > seq2sparse –input<PATH>/content/reuters/seqfiles –output
>> > <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
>> > E = foreach D generate ....   -- Pig pipeline
>> >
>> > You only need to interface Pig in the first and last step, but Pig
>> requires
>> > you to do LOAD/STORE for each job, and that's the problem. If we make
>> > Store/Load as optional, that will solve your problem, right?

Re: Pig's MAPREDUCE keyword syntax assumes input must come from Pig?

Reply via email to