Re: access hdfs file name in map()

Xu (Simon) Chen Tue, 03 Jun 2014 21:23:23 -0700

I don't quite get it..

mapPartitionWithIndex takes a function that maps an integer index and an
iterator to another iterator. How does that help with retrieving the hdfs
file name?


I am obviously missing some context..

Thanks.
 On May 30, 2014 1:28 AM, "Aaron Davidson" <[email protected]> wrote:

> Currently there is not a way to do this using textFile(). However, you
> could pretty straightforwardly define your own subclass of HadoopRDD [1] in
> order to get access to this information (likely using
> mapPartitionsWithIndex to look up the InputSplit for a particular
> partition).
>
> Note that sc.textFile() is just a convenience function to construct a new
> HadoopRDD [2].
>
> [1] HadoopRDD:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93
> [2] sc.textFile():
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
>
>
> On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen <[email protected]>
> wrote:
>
>> Hello,
>>
>> A quick question about using spark to parse text-format CSV files stored
>> on hdfs.
>>
>> I have something very simple:
>> sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p =>
>> (XXX, p[0], p[2]))
>>
>> Here, I want to replace XXX with a string, which is the current csv
>> filename for the line. This is needed since some information may be encoded
>> in the file name, like date.
>>
>> In hive, I am able to define an external table and use INPUT__FILE__NAME
>> as a column in queries. I wonder if spark has something similar.
>>
>> Thanks!
>> -Simon
>>
>
>

Re: access hdfs file name in map()

Reply via email to