Otherwise, please consider using https://github.com/databricks/spark-xml.

Actually, there is a function to find the input file name, which is..

input_file_name function,
https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L948

This is available from 1.6.0

Please refer , https://github.com/apache/spark/pull/13806 and
https://github.com/apache/spark/pull/13759
​




2016-07-12 22:04 GMT+09:00 Prashant Sharma <scrapco...@gmail.com>:

> Hi Baahu,
>
> That should not be a problem, given you allocate sufficient buffer for
> reading.
>
> I was just working on implementing a patch[1] to support the feature for
> reading wholetextfiles in SQL. This can actually be slightly better
> approach, because here we read to offheap memory for holding data(using
> unsafe interface).
>
> 1. https://github.com/apache/spark/pull/14151
>
> Thanks,
>
>
>
> --Prashant
>
>
> On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain <bahub...@gmail.com> wrote:
>
>> Hi,
>> We have a requirement where in we need to process set of xml files, each
>> of the xml files contain several records (eg:
>> <RECORD>
>>      data of record 1......
>> </RECORD>
>>
>> <RECORD>
>>     data of record 2......
>> </RECORD>
>>
>> Expected output is   <filename and individual records>
>>
>> Since we needed file name as well in output ,we chose wholetextfile() .
>> We had to go against using StreamXmlRecordReader and StreamInputFormat
>> since I could not find a way to retreive the filename.
>>
>> These xml files could be pretty big, occasionally they could reach a size
>> of 1GB.Since contents of each file would be put into a single partition,would
>> such big files be a issue ?
>> The AWS cluster(50 Nodes) that we use is fairly strong , with each
>> machine having memory of around 60GB.
>>
>> Thanks,
>> Baahu
>>
>
>

Reply via email to