On 8 September 2011 20:29, Daniel Dai <[email protected]> wrote:
> Thanks Dan, see my comments inline.
> On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <[email protected]> wrote:
>
>> Hi all! I have been experimenting with wrapping some of Apache
>> Mahout's machine learning -related jobs inside Pig macros, via the
>> MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few
>> issues, hence this mail.
>>
>
>> While I enjoyed an initial minor success, I hit a problem because the
>> job I was trying actually wanted to take input from existing data in
>> hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO
>> INTO' clause when using MAPREDUCE. Is there any reason this is not
>> optional?
>>
>
> We expect the native mapreduce job takes one input produced by Pig, and
> produce one output feeding into the rest of Pig script. This is the interface
> between Pig and Mapreduce.
> Take WordCount as an example:
> b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output'
> `wordcount input output;
>
> Pig will save a into 'input' and wordcount will take it as its input.
>
> In your script, I saw you hard code the Mahous input/output. I believe this
> is a just a test, in real world
> you will use Pig to prepare and consume input/output. Otherwise, what's the
> point to binding Pig/Mahout?

Yes, I would expect Pig could take on more of the data preparation and
filtering tasks. However Mahout itself offers several different
components that typically get pipelined together to solve problems. In
the example I was trying to extend by also making a macro for the
Mahout task 'seqdirectory',
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html ...
I'm not sure if that can be directly 'piggified', but I was expecting
that Pig could be used to express the data flow, and that a common
pattern would be for data to start with Pig, and perhaps one two or
three Mahout-based tasks, then final output back into Pig's world.

Maybe it would help to take some of the concrete examples that show up
in typical Mahout howtos, and think through how those might be
expressed in a more Piggy way? For example
http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
shows a sequence of Mahout jobs, beginning with fetching a Reuters
dataset (collection of documents), and then creating sequence files,
and then from those, creating different flavoured Sparse Vector
representations via different arguments/parameters, for subsequent
consumption in LDA and kmeans clustering jobs. Oh, and then the
results are printed/explored. Is that the kind of data flow that Pig
could reasonably be expected to manage via 'MAPREDUCE', or am I
over-stretching the mechanism?

Another example (clustering again), from
http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/
https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_vectors.sh
then 
https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_kmeans.sh

So again the flow here from those .sh scripts (i'll trim some params,
leaving just the in/out pipeline) is:

bin/mahout seqdirectory --input
examples/src/main/resources/seinfeld-scripts-preprocessed \
                        --output            out-seinfeld-seqfiles [...]
bin/mahout seq2sparse   --input             out-seinfeld-seqfiles    \
                        --output            out-seinfeld-vectors    [...]
bin/mahout kmeans       --input            out-seinfeld-vectors/tfidf-vectors \
                        --output           out-seinfeld-kmeans/clusters \
                        --clusters
out-seinfeld-kmeans/initialclusters [...]
bin/mahout clusterdump  --seqFileDir
out-seinfeld-kmeans/clusters/clusters-1 \
                        --pointsDir
out-seinfeld-kmeans/clusters/clusteredPoints \
                        --numWords          5 \
                        --dictionary
out-seinfeld-vectors/dictionary.file-0 \
                        --dictionaryType    sequencefile

I should say I'm no expert on the Mahout details either, but since a
lot of my base input data is being handled (and joined, filtered etc)
very nicely by Pig, I'm very curious about having some closer
integration here. I also have no strong intuition about the impact of
all this on efficiency, ... in terms either of parallelism, costs re
storing on disk rather than everything in Pig datastructures, etc.

>> 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt
>> - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason:
>> <file mig.macro, line 6, column 1>  mismatched input 'LOAD' expecting
>> STORE
>>
>> Complicating things further, I couldn't see a way of creating data for
>> this dummy input within Pig Latin (or at least the Grunt shell), other
>> than loading an empty file (which needed creating, cleaning up, etc).
>> Is there a syntax for declaring relations as literal data inline that
>> I'm missing? Also experimenting in Grunt I found it tricky that
>> piggybank.jar couldn't be registered within the macro I 'IMPORT', and
>> that it was all too easy to get an error from importing the same macro
>> twice within one session.
>>
>
> This definitely we want to fix.

Thanks. Let me know if you need any more detailed report / filing.

>> The Mahout/Pig proof of concept examples are at
>>
>> https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt
>>
>> Details of the Mahout side of things at
>>
>> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E
>>
>> If I'm missing something obvious that will provide for smoother
>> integration, I'd be very happy to learn. [...]
>> Is this a reasonable thing to attempt? At least in the Mahout case, it
>> looks to me common that input might come from other files in hdfs
>> rather than from Pig relations, so maybe the requirement for STORE ...
>> INTO could be softened?
>>
>
>> Thanks for any suggestions...

> That seems to be a very interesting project. Let me know your progress and
> anything I can help.

Thanks. I hit a few issues on the Mahout side too, but I'll see how
far I can get with a simple set of macros, even if I have to use the
'IGNORE' hack for now. If you have any suggestion for a cleaner
syntax/approach that'll work in Pig 0.9 I'd love to hear.

Whether this will ever be truly useful I think depends on the kind of
pipeline scenarios sketched above, i.e. where > 1 consecutive steps
are happening outside of Pig. There might be a case for interacting
with those external programs without having each step of their results
written into hdfs, but I'm not sure how that would best be
implemented.

cheers,

Dan

Reply via email to