It seems like you want to do something like this:

A = xxxxx -- Pig pipeline
B = MAPREDUCE mahout.jar Store A into '<PATH>/content/reuters/reuters-out'
seqdirectory –input <PATH>/content/reuters/reuters-out –output
<PATH>/content/reuters/seqfiles –charset UTF-8
C = MAPREDUCE mahout.jar seq2sparse –input <PATH>/content/reuters/seqfiles
–output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF
D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF'
seq2sparse –input<PATH>/content/reuters/seqfiles –output
<PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF
E = foreach D generate ....   -- Pig pipeline

You only need to interface Pig in the first and last step, but Pig requires
you to do LOAD/STORE for each job, and that's the problem. If we make
Store/Load as optional, that will solve your problem, right?

Daniel

On Thu, Sep 8, 2011 at 1:22 PM, Dan Brickley <[email protected]> wrote:

> On 8 September 2011 20:29, Daniel Dai <[email protected]> wrote:
> > Thanks Dan, see my comments inline.
> > On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <[email protected]> wrote:
> >
> >> Hi all! I have been experimenting with wrapping some of Apache
> >> Mahout's machine learning -related jobs inside Pig macros, via the
> >> MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few
> >> issues, hence this mail.
> >>
> >
> >> While I enjoyed an initial minor success, I hit a problem because the
> >> job I was trying actually wanted to take input from existing data in
> >> hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO
> >> INTO' clause when using MAPREDUCE. Is there any reason this is not
> >> optional?
> >>
> >
> > We expect the native mapreduce job takes one input produced by Pig, and
> > produce one output feeding into the rest of Pig script. This is the
> interface
> > between Pig and Mapreduce.
> > Take WordCount as an example:
> > b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output'
> > `wordcount input output;
> >
> > Pig will save a into 'input' and wordcount will take it as its input.
> >
> > In your script, I saw you hard code the Mahous input/output. I believe
> this
> > is a just a test, in real world
> > you will use Pig to prepare and consume input/output. Otherwise, what's
> the
> > point to binding Pig/Mahout?
>
> Yes, I would expect Pig could take on more of the data preparation and
> filtering tasks. However Mahout itself offers several different
> components that typically get pipelined together to solve problems. In
> the example I was trying to extend by also making a macro for the
> Mahout task 'seqdirectory',
> https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html ...
> I'm not sure if that can be directly 'piggified', but I was expecting
> that Pig could be used to express the data flow, and that a common
> pattern would be for data to start with Pig, and perhaps one two or
> three Mahout-based tasks, then final output back into Pig's world.
>
> Maybe it would help to take some of the concrete examples that show up
> in typical Mahout howtos, and think through how those might be
> expressed in a more Piggy way? For example
>
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> shows a sequence of Mahout jobs, beginning with fetching a Reuters
> dataset (collection of documents), and then creating sequence files,
> and then from those, creating different flavoured Sparse Vector
> representations via different arguments/parameters, for subsequent
> consumption in LDA and kmeans clustering jobs. Oh, and then the
> results are printed/explored. Is that the kind of data flow that Pig
> could reasonably be expected to manage via 'MAPREDUCE', or am I
> over-stretching the mechanism?
>
> Another example (clustering again), from
>
> http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/
>
> https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_vectors.sh
> then
> https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_kmeans.sh
>
> So again the flow here from those .sh scripts (i'll trim some params,
> leaving just the in/out pipeline) is:
>
> bin/mahout seqdirectory --input
> examples/src/main/resources/seinfeld-scripts-preprocessed \
>                        --output            out-seinfeld-seqfiles [...]
> bin/mahout seq2sparse   --input             out-seinfeld-seqfiles    \
>                        --output            out-seinfeld-vectors    [...]
> bin/mahout kmeans       --input
>  out-seinfeld-vectors/tfidf-vectors \
>                        --output           out-seinfeld-kmeans/clusters \
>                        --clusters
> out-seinfeld-kmeans/initialclusters [...]
> bin/mahout clusterdump  --seqFileDir
> out-seinfeld-kmeans/clusters/clusters-1 \
>                        --pointsDir
> out-seinfeld-kmeans/clusters/clusteredPoints \
>                        --numWords          5 \
>                        --dictionary
> out-seinfeld-vectors/dictionary.file-0 \
>                        --dictionaryType    sequencefile
>
> I should say I'm no expert on the Mahout details either, but since a
> lot of my base input data is being handled (and joined, filtered etc)
> very nicely by Pig, I'm very curious about having some closer
> integration here. I also have no strong intuition about the impact of
> all this on efficiency, ... in terms either of parallelism, costs re
> storing on disk rather than everything in Pig datastructures, etc.
>
> >> 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt
> >> - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason:
> >> <file mig.macro, line 6, column 1>  mismatched input 'LOAD' expecting
> >> STORE
> >>
> >> Complicating things further, I couldn't see a way of creating data for
> >> this dummy input within Pig Latin (or at least the Grunt shell), other
> >> than loading an empty file (which needed creating, cleaning up, etc).
> >> Is there a syntax for declaring relations as literal data inline that
> >> I'm missing? Also experimenting in Grunt I found it tricky that
> >> piggybank.jar couldn't be registered within the macro I 'IMPORT', and
> >> that it was all too easy to get an error from importing the same macro
> >> twice within one session.
> >>
> >
> > This definitely we want to fix.
>
> Thanks. Let me know if you need any more detailed report / filing.
>
> >> The Mahout/Pig proof of concept examples are at
> >>
> >>
> https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt
> >>
> >> Details of the Mahout side of things at
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E
> >>
> >> If I'm missing something obvious that will provide for smoother
> >> integration, I'd be very happy to learn. [...]
> >> Is this a reasonable thing to attempt? At least in the Mahout case, it
> >> looks to me common that input might come from other files in hdfs
> >> rather than from Pig relations, so maybe the requirement for STORE ...
> >> INTO could be softened?
> >>
> >
> >> Thanks for any suggestions...
>
> > That seems to be a very interesting project. Let me know your progress
> and
> > anything I can help.
>
> Thanks. I hit a few issues on the Mahout side too, but I'll see how
> far I can get with a simple set of macros, even if I have to use the
> 'IGNORE' hack for now. If you have any suggestion for a cleaner
> syntax/approach that'll work in Pig 0.9 I'd love to hear.
>
> Whether this will ever be truly useful I think depends on the kind of
> pipeline scenarios sketched above, i.e. where > 1 consecutive steps
> are happening outside of Pig. There might be a case for interacting
> with those external programs without having each step of their results
> written into hdfs, but I'm not sure how that would best be
> implemented.
>
> cheers,
>
> Dan
>

Reply via email to