Re: MapReduce Job on XML input

2007-12-10 Thread Ted Dunning
lt;[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, November 26, 2007 8:49:52 AM > Subject: MapReduce Job on XML input > > I would like to run some mapReduce jobs on some xml files I got (aprox. > 10 compressed files). > The XML files are not that

Re: MapReduce Job on XML input

2007-12-10 Thread Arun C Murthy
p://wiki.apache.org/lucene-hadoop/HowToContribute thanks, Arun >- Original Message >From: Peter Thygesen <[EMAIL PROTECTED]> >To: [email protected] >Sent: Monday, November 26, 2007 8:49:52 AM >Subject: MapReduce Job on XML input > >I would like to run some

Re: MapReduce Job on XML input

2007-12-10 Thread Alan Ho
I've written a xml input splitter based on a Stax parser. Its much better than StreamXMLRecordReader - Original Message From: Peter Thygesen <[EMAIL PROTECTED]> To: [email protected] Sent: Monday, November 26, 2007 8:49:52 AM Subject: MapReduce Job on XML input I

Re: MapReduce Job on XML input

2007-11-26 Thread Ted Dunning
That isn't all that many files. At 1MB, you shouldn't be seeing much performance hit due to reading many files. You will need a special input format but it can be very simple. Just extend something like TextInputFormat and replace the record reader and report the file as unsplittable. On 11/2

MapReduce Job on XML input

2007-11-26 Thread Peter Thygesen
I would like to run some mapReduce jobs on some xml files I got (aprox. 10 compressed files). The XML files are not that big about 1 Mb compressed, each containing about 1000 records. Do I have to write my own InputSplitter? Should I use MultiFileInputFormat or StreamInputFormat? Can I use t