Thanks Dan, see my comments inline. Daniel
On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <[email protected]> wrote: > Hi all! I have been experimenting with wrapping some of Apache > Mahout's machine learning -related jobs inside Pig macros, via the > MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few > issues, hence this mail. > > While I enjoyed an initial minor success, I hit a problem because the > job I was trying actually wanted to take input from existing data in > hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO > INTO' clause when using MAPREDUCE. Is there any reason this is not > optional? > We expect the native mapreduce job takes one input produced by Pig, and produce one output feeding into the rest of Pig script. This is the interface between Pig and Mapreduce. Take WordCount as an example: b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output' `wordcount input output; Pig will save a into 'input' and wordcount will take it as its input. In your script, I saw you hard code the Mahous input/output. I believe this is a just a test, in real world you will use Pig to prepare and consume input/output. Otherwise, what's the point to binding Pig/Mahous? > > 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason: > <file mig.macro, line 6, column 1> mismatched input 'LOAD' expecting > STORE > > Complicating things further, I couldn't see a way of creating data for > this dummy input within Pig Latin (or at least the Grunt shell), other > than loading an empty file (which needed creating, cleaning up, etc). > Is there a syntax for declaring relations as literal data inline that > I'm missing? Also experimenting in Grunt I found it tricky that > piggybank.jar couldn't be registered within the macro I 'IMPORT', and > that it was all too easy to get an error from importing the same macro > twice within one session. > This definitely we want to fix. > > The Mahout/Pig proof of concept examples are at > > https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt > > Details of the Mahout side of things at > > http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E > > If I'm missing something obvious that will provide for smoother > integration, I'd be very happy to learn. Currently what I have is just > this example (simplest case of reading seq directory in mahout and > doing downstream filtering of mahout results in pig latin): > > > run miglib.pig; -- basic setup, including macro definitions > > -- get collocated phrases from a seqdir > reuters_phrases = > collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE); > > political_phrases = FILTER reuters_phrases BY phrase MATCHES > '.*(president|minister|government|election).*' AND score > (float)10; > > I'd love to get rid of the 'IGNORE' here, but this is the macro expansion: > > DEFINE collocations (SEQDIR,IGNORE) RETURNS sorted_concepts { > DEFINE SequenceFileLoader > org.apache.pig.piggybank.storage.SequenceFileLoader(); > raw_concepts = MAPREDUCE > '../../core/target/mahout-core-0.6-SNAPSHOT-job.jar' STORE IGNORE INTO > 'migtest/dummy-input' LOAD > 'migtest/collocations_output/ngrams/part-r-*' USING SequenceFileLoader > AS (phrase: chararray, score: float) > `org.apache.mahout.driver.MahoutDriver > org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i $SEQDIR > -o migtest/collocations_output --analyzerName > org.apache.mahout.vectorizer.DefaultAnalyzer --maxNGramSize 2 > --preprocess --overwrite `; > $sorted_concepts = order raw_concepts by score desc; > }; > > > Is this a reasonable thing to attempt? At least in the Mahout case, it > looks to me common that input might come from other files in hdfs > rather than from Pig relations, so maybe the requirement for STORE ... > INTO could be softened? > > Thanks for any suggestions... > That seems to be a very interesting project. Let me know your progress and anything I can help. > > Dan >
