I put the 0.9 distribution onto my cluster, updated my MAHOUT_HOME. The discrepancy from the tutorial (wikipediaXMLSplitter vs org.apache.mahout.text.wikipedia.WikipediaXMLSplitter) still stands.
Then I got a new issue: NoSuchMethodError: org.apache.hadoop.util.ProgramDriver.driver([Ljava/lang/String;) >From peeking around I tried running it as: .../hadoop jar $MAHOUT_HOME/mahout-examples-0.9-job.jar org.apache.mahout.text.wikipedia.WikipediaXmlSplitter -d $MAHOUT_HOME/examples/tmp/enwiki-latest-pages-articles.xml -o wikipedia/chunks -c 64 and back to the heap space error. Full details here: http://pastebin.com/rCNyTypf Any advice is greatly appreciated. Jessie ---------- Forwarded message ---------- From: Suneel Marthi <[email protected]> To: "[email protected]" <[email protected]> Cc: Date: Sat, 1 Mar 2014 09:15:22 -0800 (PST) Subject: Re: wikipedia bayes quickstart example on EC2 (cloudera) Please work off of the latest Mahout 0.9, most of these issues from Mahout 0.7 have been addressed in later releases. On Saturday, March 1, 2014 12:14 PM, Jessie Wright <[email protected]> wrote: Hi, I'm a noob and trying to run the wikipedia bayes example on EC2 (using a cdh4.5 setup). I've searched the archives and haven't been able to find info on this. I apologize if this is a duplicate question. The cloudera install comes with Mahout 0.7. I've run into a few snags on the first step (chunking the data into pieces). The first was that it couldn't find wikipediaXMLSplitter but I found that substituting org.apache.mahout.text.wikipedia.WikipediaXmlSplitter in the command it got past that error. (just changing the capitalization wasn't enough) However I am now stuck. I'm getting a java.lang.OutOfMemoryError: Java heap space error. I upped MAHOUT_HEAPSIZE to 5000 and am still getting the same error. See the full error here: http://pastebin.com/P5PYuR8U (I added a print statement to the mahout/bin just to confirm that my export of MAHOUT_HEAPSIZE was being successfully detected) I'm wondering whether some other setting is overriding the MAHOUT_HEAPSIZE? One of the hadoop or cloudera specific ones? Does anyone have any experience with this or suggestions? Thank you, Jessie Wright
