Hi all! I have been experimenting with wrapping some of Apache
Mahout's machine learning -related jobs inside Pig macros, via the
MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few
issues, hence this mail.

While I enjoyed an initial minor success, I hit a problem because the
job I was trying actually wanted to take input from existing data in
hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO
INTO' clause when using MAPREDUCE. Is there any reason this is not
optional?

2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason:
<file mig.macro, line 6, column 1>  mismatched input 'LOAD' expecting
STORE

Complicating things further, I couldn't see a way of creating data for
this dummy input within Pig Latin (or at least the Grunt shell), other
than loading an empty file (which needed creating, cleaning up, etc).
Is there a syntax for declaring relations as literal data inline that
I'm missing? Also experimenting in Grunt I found it tricky that
piggybank.jar couldn't be registered within the macro I 'IMPORT', and
that it was all too easy to get an error from importing the same macro
twice within one session.

The Mahout/Pig proof of concept examples are at
https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt

Details of the Mahout side of things at
http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E

If I'm missing something obvious that will provide for smoother
integration, I'd be very happy to learn. Currently what I have is just
this example (simplest case of reading seq directory in mahout and
doing downstream filtering of mahout results in pig latin):


run miglib.pig; -- basic setup, including macro definitions

-- get collocated phrases from a seqdir
reuters_phrases =
collocations('/user/danbri/migtest/reuters-out-seqdir', IGNORE);

political_phrases = FILTER reuters_phrases BY phrase MATCHES
'.*(president|minister|government|election).*' AND score > (float)10;

I'd love to get rid of the 'IGNORE' here, but this is the macro expansion:

DEFINE collocations (SEQDIR,IGNORE) RETURNS sorted_concepts {
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
raw_concepts = MAPREDUCE
'../../core/target/mahout-core-0.6-SNAPSHOT-job.jar' STORE IGNORE INTO
'migtest/dummy-input' LOAD
'migtest/collocations_output/ngrams/part-r-*' USING SequenceFileLoader
AS (phrase: chararray, score: float)
`org.apache.mahout.driver.MahoutDriver
org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i $SEQDIR
-o migtest/collocations_output --analyzerName
org.apache.mahout.vectorizer.DefaultAnalyzer --maxNGramSize 2
--preprocess --overwrite `;
$sorted_concepts = order raw_concepts by score desc;
};


Is this a reasonable thing to attempt? At least in the Mahout case, it
looks to me common that input might come from other files in hdfs
rather than from Pig relations, so maybe the requirement for STORE ...
INTO could be softened?

Thanks for any suggestions...

Dan

Reply via email to