On Fri, Sep 9, 2011 at 12:18 AM, Dan Brickley <[email protected]> wrote:

> On 9 September 2011 01:28, Daniel Dai <[email protected]> wrote:
> > Yes, makes sense to change it in Pig anyway. The code is in
> > org.apache.pig.parser.LogicalPlanBuilder.buildNativeOp. You may also need
> to
> > change parser to make Load/Store optional. Would you want to give a try?
>
> Having slept on this, I'm not so sure now. If we lose the LOAD/STORE
> then Pig knows that relation B needs A; but it doesn't see that
> relation C and relation D are each defined in terms of the (final,
> complete) result of B.
>
> Without this information, how is Pig's execution engine supposed to
> plan dependencies appropriately? Is there not a risk that these
> logically sequential jobs are initiated in parallel?
>

Yes, we also need a way to specify the dependency, like:
B = MAPREDUCE A jar ......


> Re Shawn's suggestion to drive everything from Python; I'm openminded.
> Whatever works, really. I've not tried wrapping Pig in Python yet,
> I've only seen it used for UDFs.
>

This should to be a good approach. With Python you get more flexibility. And
it's
easy to embed Pig script in it.


>
> Dan
>
> >> > A = xxxxx -- Pig pipeline
> >> > B = MAPREDUCE mahout.jar Store A into
> >> '<PATH>/content/reuters/reuters-out'
> >> > seqdirectory –input <PATH>/content/reuters/reuters-out –output
> >> > <PATH>/content/reuters/seqfiles –charset UTF-8
> >> > C = MAPREDUCE mahout.jar seq2sparse –input
> >> <PATH>/content/reuters/seqfiles
> >> > –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
> >> > D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
> >> > seq2sparse –input<PATH>/content/reuters/seqfiles –output
> >> > <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
> >> > E = foreach D generate ....   -- Pig pipeline
> >> >
> >> > You only need to interface Pig in the first and last step, but Pig
> >> requires
> >> > you to do LOAD/STORE for each job, and that's the problem. If we make
> >> > Store/Load as optional, that will solve your problem, right?
>

Reply via email to