It's expecting a constructor with a certain signature, when there is no such constructor. I suspect you are not meant to use the input format with Streaming in this way, but I don't know the exact nature of what you need to do. See the comment in StreamInputFormat's javadoc:
/** An input format that selects a RecordReader based on a JobConf property. * This should be used only for non-standard record reader such as * StreamXmlRecordReader. For all other standard * record readers, the appropriate input format classes should be used. */ On Thu, Jul 14, 2011 at 8:48 PM, Diederik van Liere <[email protected]> wrote: > > Hi, > Sandeep thanks so much for your reply. Yes I am aware of that blogpost but > it does not explain how to use Mahout with Hadoop's streaming interface. > I issued the following command: > > hadoop jar > /usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-cdh3u0.jar > -Dmapred.reduce.tasks=0 > -Dmapred.child.java.opts=-Xmx1024m > -Dmapred.child.ulimit=3145728 > -libjars > /usr/local/mahout/examples/target/mahout-examples-0.6-SNAPSHOT.jar > -input /usr/hadoop/enwiki-20110405-pages-meta-history5.xml > -output /usr/hadoop/out > -mapper ~/wikihadoop/xml_streamer_simulated.py > -inputreader > "org.apache.mahout.classifier.bayes.XmlInputFormat,begin=<page,end=</page>" > > And I got the following output: > > 2011-07-14 17:45:58,894 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded > the native-hadoop library > 2011-07-14 17:45:59,231 INFO > org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: > /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_20 > 1107131716_0002/jars/.job.jar.crc <- > /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107131716_0002/attempt_201107131716_0002_m_000000_0/work/.job.jar.crc > 2011-07-14 17:45:59,242 INFO > org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating symlink: > /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_20 > 1107131716_0002/jars/job.jar <- > /data/hadoop/datastore/hadoop-hadoop/mapred/local/taskTracker/hadoop/jobcache/job_201107131716_0002/attempt_201107131716_0002_m_000000_0/work/job.jar > 2011-07-14 17:45:59,328 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > Initializing JVM Metrics with processName=MAP, sessionId= > 2011-07-14 17:45:59,552 INFO org.apache.hadoop.mapred.FileInputFormat: > getRecordReader > start.....split=hdfs://beta:54310/usr/hadoop/enwiki-20110405-pages-meta-history5.xml:0+67108864 > 2011-07-14 17:45:59,778 INFO org.apache.hadoop.mapred.TaskLogsTruncater: > Initializing logsʼ truncater with mapRetainSize=-1 and reduceRetainSize=-1 > 2011-07-14 17:45:59,802 WARN org.apache.hadoop.mapred.Child: Error running > child > java.lang.RuntimeException: java.lang.NoSuchMethodException: > org.apache.mahout.classifier.bayes.XmlInputFormat.<init>(org.apache.hadoop.fs.FSDataInputStream, > org.apache.hadoop.mapred.FileSpli > t, org.apache.hadoop.mapred.Reporter, org.apache.hadoop.mapred.JobConf, > org.apache.hadoop.fs.FileSystem) > at > org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:69) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324) > at org.apache.hadoop.mapred.Child$4.run(Child.java:268) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.mapred.Child.main(Child.java:262) > Caused by: java.lang.NoSuchMethodException: > org.apache.mahout.classifier.bayes.XmlInputFormat.<init>(org.apache.hadoop.fs.FSDataInputStream, > org.apache.hadoop.mapred.FileSplit, org.apache.had > oop.mapred.Reporter, org.apache.hadoop.mapred.JobConf, > org.apache.hadoop.fs.FileSystem) > at java.lang.Class.getConstructor0(Class.java:2706) > at java.lang.Class.getConstructor(Class.java:1657) > at > org.apache.hadoop.streaming.StreamInputFormat.getRecordReader(StreamInputFormat.java:66) > ... 7 more > 2011-07-14 17:45:59,806 INFO org.apache.hadoop.mapred.Task: Runnning cleanup > for the task > > How can I fix this problem? Or does this mean that you cannot use Mahout and > Hadoop Streaming? > Any other suggestions are welcome too! > > Best, > Diederik > > > -----Original Message----- > From: Sandeep Parikh [mailto:[email protected]] On Behalf Of Sandeep Parikh > Sent: July-12-11 5:42 PM > To: [email protected] > Subject: Re: Using Mahout XmlInputFormat with Hadoop Streaming > > There's an old post on http://xmlandhadoop.blogspot.com/ that provides some > direction on using Mahout's XmlInputFormat to read XML from HDFS. As I > recall, the code itself contains some errors but in general, should be > sufficient to get you started using this input format. > > If your XML records look like the following snippet <element key="value"> > <child/> <child/> </element> > > Then you'll seed to set "xmlinput.start" and "xmlinput.end" as "<element" and > "</element>", respectively, when configuring your job. That little nit cost > me a few minutes when I last used this format. > > -Sandeep > > On Tuesday, July 12, 2011 at 11:51 AM, Diederik van Liere wrote: > > > Hi Mahout list, > > > > I've got a quick question: is it possible to use Mahout's > > XmlInputFormat<http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java> > > in combination with Hadoop Streaming? I would like to replace Hadoop's > > xmlrecordreader. I have been looking for an example but couldn't find any > > so maybe this is not possible (but just want to make sure that I am not > > missing anything). > > Thanks for your help. > > > > Best, > > Diederik >
