lt;[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, November 26, 2007 8:49:52 AM
> Subject: MapReduce Job on XML input
>
> I would like to run some mapReduce jobs on some xml files I got (aprox.
> 10 compressed files).
> The XML files are not that
p://wiki.apache.org/lucene-hadoop/HowToContribute
thanks,
Arun
>- Original Message
>From: Peter Thygesen <[EMAIL PROTECTED]>
>To: [email protected]
>Sent: Monday, November 26, 2007 8:49:52 AM
>Subject: MapReduce Job on XML input
>
>I would like to run some
I've written a xml input splitter based on a Stax parser. Its much better than
StreamXMLRecordReader
- Original Message
From: Peter Thygesen <[EMAIL PROTECTED]>
To: [email protected]
Sent: Monday, November 26, 2007 8:49:52 AM
Subject: MapReduce Job on XML input
I
That isn't all that many files. At 1MB, you shouldn't be seeing much
performance hit due to reading many files.
You will need a special input format but it can be very simple. Just extend
something like TextInputFormat and replace the record reader and report the
file as unsplittable.
On 11/2
I would like to run some mapReduce jobs on some xml files I got (aprox.
10 compressed files).
The XML files are not that big about 1 Mb compressed, each containing
about 1000 records.
Do I have to write my own InputSplitter? Should I use
MultiFileInputFormat or StreamInputFormat? Can I use t