There's an old post on http://xmlandhadoop.blogspot.com/ that provides some 
direction on using Mahout's XmlInputFormat to read XML from HDFS. As I recall, 
the code itself contains some errors but in general, should be sufficient to 
get you started using this input format.

If your XML records look like the following snippet
<element key="value">
<child/>
<child/>
</element>

Then you'll seed to set "xmlinput.start" and "xmlinput.end" as "<element" and 
"</element>", respectively, when configuring your job. That little nit cost me 
a few minutes when I last used this format.

-Sandeep

On Tuesday, July 12, 2011 at 11:51 AM, Diederik van Liere wrote:

> Hi Mahout list,
> 
> I've got a quick question: is it possible to use Mahout's 
> XmlInputFormat<http://github.com/apache/mahout/blob/ad84344e4055b1e6adff5779339a33fa29e1265d/examples/src/main/java/org/apache/mahout/classifier/bayes/XmlInputFormat.java>
>  in combination with Hadoop Streaming? I would like to replace Hadoop's 
> xmlrecordreader. I have been looking for an example but couldn't find any so 
> maybe this is not possible (but just want to make sure that I am not missing 
> anything).
> Thanks for your help.
> 
> Best,
> Diederik

Reply via email to