Otherwise, please consider using https://github.com/databricks/spark-xml.
Actually, there is a function to find the input file name, which is.. input_file_name function, https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L948 This is available from 1.6.0 Please refer , https://github.com/apache/spark/pull/13806 and https://github.com/apache/spark/pull/13759 2016-07-12 22:04 GMT+09:00 Prashant Sharma <scrapco...@gmail.com>: > Hi Baahu, > > That should not be a problem, given you allocate sufficient buffer for > reading. > > I was just working on implementing a patch[1] to support the feature for > reading wholetextfiles in SQL. This can actually be slightly better > approach, because here we read to offheap memory for holding data(using > unsafe interface). > > 1. https://github.com/apache/spark/pull/14151 > > Thanks, > > > > --Prashant > > > On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain <bahub...@gmail.com> wrote: > >> Hi, >> We have a requirement where in we need to process set of xml files, each >> of the xml files contain several records (eg: >> <RECORD> >> data of record 1...... >> </RECORD> >> >> <RECORD> >> data of record 2...... >> </RECORD> >> >> Expected output is <filename and individual records> >> >> Since we needed file name as well in output ,we chose wholetextfile() . >> We had to go against using StreamXmlRecordReader and StreamInputFormat >> since I could not find a way to retreive the filename. >> >> These xml files could be pretty big, occasionally they could reach a size >> of 1GB.Since contents of each file would be put into a single partition,would >> such big files be a issue ? >> The AWS cluster(50 Nodes) that we use is fairly strong , with each >> machine having memory of around 60GB. >> >> Thanks, >> Baahu >> > >