On 8 September 2011 20:29, Daniel Dai <[email protected]> wrote: > Thanks Dan, see my comments inline. > On Wed, Sep 7, 2011 at 8:08 AM, Dan Brickley <[email protected]> wrote: > >> Hi all! I have been experimenting with wrapping some of Apache >> Mahout's machine learning -related jobs inside Pig macros, via the >> MAPREDUCE keyword. This seemed quite nearly do-able but I hit a few >> issues, hence this mail. >> > >> While I enjoyed an initial minor success, I hit a problem because the >> job I was trying actually wanted to take input from existing data in >> hdfs, rather than from Pig. However it seems Pig requires a 'STORE FOO >> INTO' clause when using MAPREDUCE. Is there any reason this is not >> optional? >> > > We expect the native mapreduce job takes one input produced by Pig, and > produce one output feeding into the rest of Pig script. This is the interface > between Pig and Mapreduce. > Take WordCount as an example: > b = mapreduce 'hadoop-examples.jar' Store a into 'input' Load 'output' > `wordcount input output; > > Pig will save a into 'input' and wordcount will take it as its input. > > In your script, I saw you hard code the Mahous input/output. I believe this > is a just a test, in real world > you will use Pig to prepare and consume input/output. Otherwise, what's the > point to binding Pig/Mahout?
Yes, I would expect Pig could take on more of the data preparation and filtering tasks. However Mahout itself offers several different components that typically get pipelined together to solve problems. In the example I was trying to extend by also making a macro for the Mahout task 'seqdirectory', https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html ... I'm not sure if that can be directly 'piggified', but I was expecting that Pig could be used to express the data flow, and that a common pattern would be for data to start with Pig, and perhaps one two or three Mahout-based tasks, then final output back into Pig's world. Maybe it would help to take some of the concrete examples that show up in typical Mahout howtos, and think through how those might be expressed in a more Piggy way? For example http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ shows a sequence of Mahout jobs, beginning with fetching a Reuters dataset (collection of documents), and then creating sequence files, and then from those, creating different flavoured Sparse Vector representations via different arguments/parameters, for subsequent consumption in LDA and kmeans clustering jobs. Oh, and then the results are printed/explored. Is that the kind of data flow that Pig could reasonably be expected to manage via 'MAPREDUCE', or am I over-stretching the mechanism? Another example (clustering again), from http://blog.jteam.nl/2011/04/04/how-to-cluster-seinfeld-episodes-with-mahout/ https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_vectors.sh then https://github.com/frankscholten/mahout/blob/seinfeld_demo/examples/bin/seinfeld_kmeans.sh So again the flow here from those .sh scripts (i'll trim some params, leaving just the in/out pipeline) is: bin/mahout seqdirectory --input examples/src/main/resources/seinfeld-scripts-preprocessed \ --output out-seinfeld-seqfiles [...] bin/mahout seq2sparse --input out-seinfeld-seqfiles \ --output out-seinfeld-vectors [...] bin/mahout kmeans --input out-seinfeld-vectors/tfidf-vectors \ --output out-seinfeld-kmeans/clusters \ --clusters out-seinfeld-kmeans/initialclusters [...] bin/mahout clusterdump --seqFileDir out-seinfeld-kmeans/clusters/clusters-1 \ --pointsDir out-seinfeld-kmeans/clusters/clusteredPoints \ --numWords 5 \ --dictionary out-seinfeld-vectors/dictionary.file-0 \ --dictionaryType sequencefile I should say I'm no expert on the Mahout details either, but since a lot of my base input data is being handled (and joined, filtered etc) very nicely by Pig, I'm very curious about having some closer integration here. I also have no strong intuition about the impact of all this on efficiency, ... in terms either of parallelism, costs re storing on disk rather than everything in Pig datastructures, etc. >> 2011-09-07 17:08:05,528 [main] ERROR org.apache.pig.tools.grunt.Grunt >> - ERROR 1200: <line 4> Failed to parse macro 'collocations'. Reason: >> <file mig.macro, line 6, column 1> mismatched input 'LOAD' expecting >> STORE >> >> Complicating things further, I couldn't see a way of creating data for >> this dummy input within Pig Latin (or at least the Grunt shell), other >> than loading an empty file (which needed creating, cleaning up, etc). >> Is there a syntax for declaring relations as literal data inline that >> I'm missing? Also experimenting in Grunt I found it tricky that >> piggybank.jar couldn't be registered within the macro I 'IMPORT', and >> that it was all too easy to get an error from importing the same macro >> twice within one session. >> > > This definitely we want to fix. Thanks. Let me know if you need any more detailed report / filing. >> The Mahout/Pig proof of concept examples are at >> >> https://raw.github.com/gist/1192831/f9376f0b73533a0a0af4e8d89b6ea3d1871692ff/gistfile1.txt >> >> Details of the Mahout side of things at >> >> http://mail-archives.apache.org/mod_mbox/mahout-user/201109.mbox/raw/%3CCAFNgM+YaPgFzEJP+sDNXPKfxi0AmAY7_sN79FN2=1cvjl8c...@mail.gmail.com%3E >> >> If I'm missing something obvious that will provide for smoother >> integration, I'd be very happy to learn. [...] >> Is this a reasonable thing to attempt? At least in the Mahout case, it >> looks to me common that input might come from other files in hdfs >> rather than from Pig relations, so maybe the requirement for STORE ... >> INTO could be softened? >> > >> Thanks for any suggestions... > That seems to be a very interesting project. Let me know your progress and > anything I can help. Thanks. I hit a few issues on the Mahout side too, but I'll see how far I can get with a simple set of macros, even if I have to use the 'IGNORE' hack for now. If you have any suggestion for a cleaner syntax/approach that'll work in Pig 0.9 I'd love to hear. Whether this will ever be truly useful I think depends on the kind of pipeline scenarios sketched above, i.e. where > 1 consecutive steps are happening outside of Pig. There might be a case for interacting with those external programs without having each step of their results written into hdfs, but I'm not sure how that would best be implemented. cheers, Dan
