It seems like you want to do something like this: A = xxxxx -- Pig pipeline B = MAPREDUCE mahout.jar Store A into '<PATH>/content/reuters/reuters-out' seqdirectory –input <PATH>/content/reuters/reuters-out –output <PATH>/content/reuters/seqfiles –charset UTF-8 C = MAPREDUCE mahout.jar seq2sparse –input <PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF –norm 2 –weight TF D = MAPREDUCE mahout.jar Load '<PATH>/content/reuters/seqfiles-TF-IDF' seq2sparse –input<PATH>/content/reuters/seqfiles –output <PATH>/content/reuters/seqfiles-TF-IDF –norm 2 –weight TFIDF E = foreach D generate .... -- Pig pipeline
You only need to interface Pig in the first and last step, but Pig requires you to do LOAD/STORE for each job, and that's the problem. If we make Store/Load as optional, that will solve your problem, right? Daniel On Thu, Sep 8, 2011 at 1:22 PM, Dan Brickley <[email protected]> wrote: > On 8 September 2011 20:29, Daniel Dai <[email protected]> wrote: > > Thanks Dan, see my comments inline. > > On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <[email protected]> wrote: > > > >> Hi all! I have been experimenting with wrapping some of Apache > >> Mahout's machine learning -related jobs inside Pig macros, via the > >> MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few > >> issues, hence this mail. > >> > > > >> While I enjoyed an initial minor success, I hit a problem because the > >> job I was trying actually wanted to take input from existing data in > >> hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO > >> INTO' clause when using MAPREDUCE. Is there any reason this is not > >> optional? > >> > > > > We expect the native mapreduce job takes one input produced by Pig, and > > produce one output feeding into the rest of Pig script. This is the > interface > > between Pig and Mapreduce. > > Take WordCount as an example: > > b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output' > > `wordcount input output; > > > > Pig will save a into 'input' and wordcount will take it as its input. > > > > In your script, I saw you hard code the Mahous input/output. I believe > this > > is a just a test, in real world > > you will use Pig to prepare and consume input/output. Otherwise, what's > the > > point to binding Pig/Mahout? > > Yes, I would expect Pig could take on more of the data preparation and > filtering tasks. However Mahout itself offers several different > components that typically get pipelined together to solve problems. In > the example I was trying to extend by also making a macro for the > Mahout task 'seqdirectory', > https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html ... > I'm not sure if that can be directly 'piggified', but I was expecting > that Pig could be used to express the data flow, and that a common > pattern would be for data to start with Pig, and perhaps one two or > three Mahout-based tasks, then final output back into Pig's world. > > Maybe it would help to take some of the concrete examples that show up > in typical Mahout howtos, and think through how those might be > expressed in a more Piggy way? For example > > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > shows a sequence of Mahout jobs, beginning with fetching a Reuters > dataset (collection of documents), and then creating sequence files, > and then from those, creating different flavoured Sparse Vector > representations via different arguments/parameters, for subsequent > consumption in LDA and kmeans clustering jobs. Oh, and then the > results are printed/explored. Is that the kind of data flow that Pig > could reasonably be expected to manage via 'MAPREDUCE', or am I > over-stretching the mechanism? > > Another example (clustering again), from > > http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/ > > https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_vectors.sh > then > https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_kmeans.sh > > So again the flow here from those .sh scripts (i'll trim some params, > leaving just the in/out pipeline) is: > > bin/mahout seqdirectory --input > examples/src/main/resources/seinfeld-scripts-preprocessed \ > --output out-seinfeld-seqfiles [...] > bin/mahout seq2sparse --input out-seinfeld-seqfiles \ > --output out-seinfeld-vectors [...] > bin/mahout kmeans --input > out-seinfeld-vectors/tfidf-vectors \ > --output out-seinfeld-kmeans/clusters \ > --clusters > out-seinfeld-kmeans/initialclusters [...] > bin/mahout clusterdump --seqFileDir > out-seinfeld-kmeans/clusters/clusters-1 \ > --pointsDir > out-seinfeld-kmeans/clusters/clusteredPoints \ > --numWords 5 \ > --dictionary > out-seinfeld-vectors/dictionary.file-0 \ > --dictionaryType sequencefile > > I should say I'm no expert on the Mahout details either, but since a > lot of my base input data is being handled (and joined, filtered etc) > very nicely by Pig, I'm very curious about having some closer > integration here. I also have no strong intuition about the impact of > all this on efficiency, ... in terms either of parallelism, costs re > storing on disk rather than everything in Pig datastructures, etc. > > >> 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt > >> - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason: > >> <file mig.macro, line 6, column 1> mismatched input 'LOAD' expecting > >> STORE > >> > >> Complicating things further, I couldn't see a way of creating data for > >> this dummy input within Pig Latin (or at least the Grunt shell), other > >> than loading an empty file (which needed creating, cleaning up, etc). > >> Is there a syntax for declaring relations as literal data inline that > >> I'm missing? Also experimenting in Grunt I found it tricky that > >> piggybank.jar couldn't be registered within the macro I 'IMPORT', and > >> that it was all too easy to get an error from importing the same macro > >> twice within one session. > >> > > > > This definitely we want to fix. > > Thanks. Let me know if you need any more detailed report / filing. > > >> The Mahout/Pig proof of concept examples are at > >> > >> > https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt > >> > >> Details of the Mahout side of things at > >> > >> > http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E > >> > >> If I'm missing something obvious that will provide for smoother > >> integration, I'd be very happy to learn. [...] > >> Is this a reasonable thing to attempt? At least in the Mahout case, it > >> looks to me common that input might come from other files in hdfs > >> rather than from Pig relations, so maybe the requirement for STORE ... > >> INTO could be softened? > >> > > > >> Thanks for any suggestions... > > > That seems to be a very interesting project. Let me know your progress > and > > anything I can help. > > Thanks. I hit a few issues on the Mahout side too, but I'll see how > far I can get with a simple set of macros, even if I have to use the > 'IGNORE' hack for now. If you have any suggestion for a cleaner > syntax/approach that'll work in Pig 0.9 I'd love to hear. > > Whether this will ever be truly useful I think depends on the kind of > pipeline scenarios sketched above, i.e. where > 1 consecutive steps > are happening outside of Pig. There might be a case for interacting > with those external programs without having each step of their results > written into hdfs, but I'm not sure how that would best be > implemented. > > cheers, > > Dan >
