On 8 Sep 2011, at 23:36, Daniel Dai <[email protected]> wrote:

> It seems like you want to do something like this:
> 
> A = xxxxx -- Pig pipeline
> B = MAPREDUCE mahout.jar Store A into '<PATH>/content/reuters/reuters-out'
> seqdirectory –input <PATH>/content/reuters/reuters-out –output
> <PATH>/content/reuters/seqfiles –charset UTF-8
> C = MAPREDUCE mahout.jar seq2sparse –input <PATH>/content/reuters/seqfiles
> –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
> D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
> seq2sparse –input<PATH>/content/reuters/seqfiles –output
> <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
> E = foreach D generate ....   -- Pig pipeline
> 
> You only need to interface Pig in the first and last step, but Pig requires
> you to do LOAD/STORE for each job, and that's the problem. If we make
> Store/Load as optional, that will solve your problem, right?

I think so. I'd like to confirm that this really works ok before asking for a 
change to Pig. But I guess there should be other non-Mahout scenarios that have 
similar needs. Can you suggest where to patch Pig to make store/load optional?

Dan

Reply via email to