Hi all! I have been experimenting with wrapping some of Apache Mahout's machine learning -related jobs inside Pig macros, via the MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few issues, hence this mail.
While I enjoyed an initial minor success, I hit a problem because the job I was trying actually wanted to take input from existing data in hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO INTO' clause when using MAPREDUCE. Is there any reason this is not optional? 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason: <file mig.macro, line 6, column 1> mismatched input 'LOAD' expecting STORE Complicating things further, I couldn't see a way of creating data for this dummy input within Pig Latin (or at least the Grunt shell), other than loading an empty file (which needed creating, cleaning up, etc). Is there a syntax for declaring relations as literal data inline that I'm missing? Also experimenting in Grunt I found it tricky that piggybank.jar couldn't be registered within the macro I 'IMPORT', and that it was all too easy to get an error from importing the same macro twice within one session. The Mahout/Pig proof of concept examples are at https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt Details of the Mahout side of things at http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E If I'm missing something obvious that will provide for smoother integration, I'd be very happy to learn. Currently what I have is just this example (simplest case of reading seq directory in mahout and doing downstream filtering of mahout results in pig latin): run miglib.pig; -- basic setup, including macro definitions -- get collocated phrases from a seqdir reuters_phrases = collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE); political_phrases = FILTER reuters_phrases BY phrase MATCHES '.*(president|minister|government|election).*' AND score > (float)10; I'd love to get rid of the 'IGNORE' here, but this is the macro expansion: DEFINE collocations (SEQDIR,IGNORE) RETURNS sorted_concepts { DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); raw_concepts = MAPREDUCE '../../core/target/mahout-core-0.6-SNAPSHOT-job.jar' STORE IGNORE INTO 'migtest/dummy-input' LOAD 'migtest/collocations_output/ngrams/part-r-*' USING SequenceFileLoader AS (phrase: chararray, score: float) `org.apache.mahout.driver.MahoutDriver org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i $SEQDIR -o migtest/collocations_output --analyzerName org.apache.mahout.vectorizer.DefaultAnalyzer --maxNGramSize 2 --preprocess --overwrite `; $sorted_concepts = order raw_concepts by score desc; }; Is this a reasonable thing to attempt? At least in the Mahout case, it looks to me common that input might come from other files in hdfs rather than from Pig relations, so maybe the requirement for STORE ... INTO could be softened? Thanks for any suggestions... Dan
