Hi,
We have a requirement where in we need to process set of xml files, each of
the xml files contain several records (eg:
<RECORD>
     data of record 1......
</RECORD>

<RECORD>
    data of record 2......
</RECORD>

Expected output is   <filename and individual records>

Since we needed file name as well in output ,we chose wholetextfile() . We
had to go against using StreamXmlRecordReader and StreamInputFormat since I
could not find a way to retreive the filename.

These xml files could be pretty big, occasionally they could reach a size
of 1GB.Since contents of each file would be put into a single partition,would
such big files be a issue ?
The AWS cluster(50 Nodes) that we use is fairly strong , with each machine
having memory of around 60GB.

Thanks,
Baahu

Reply via email to