Hi, We have a requirement where in we need to process set of xml files, each of the xml files contain several records (eg: <RECORD> data of record 1...... </RECORD>
<RECORD> data of record 2...... </RECORD> Expected output is <filename and individual records> Since we needed file name as well in output ,we chose wholetextfile() . We had to go against using StreamXmlRecordReader and StreamInputFormat since I could not find a way to retreive the filename. These xml files could be pretty big, occasionally they could reach a size of 1GB.Since contents of each file would be put into a single partition,would such big files be a issue ? The AWS cluster(50 Nodes) that we use is fairly strong , with each machine having memory of around 60GB. Thanks, Baahu