Re: access hdfs file name in map()

Xu (Simon) Chen Wed, 04 Jun 2014 13:01:28 -0700

N/M.. I wrote a HadoopRDD subclass and append one env field of the
HadoopPartition to the value in compute function. It worked pretty well.


Thanks!
On Jun 4, 2014 12:22 AM, "Xu (Simon) Chen" <[email protected]> wrote:

> I don't quite get it..
>
> mapPartitionWithIndex takes a function that maps an integer index and an
> iterator to another iterator. How does that help with retrieving the hdfs
> file name?
>
> I am obviously missing some context..
>
> Thanks.
>  On May 30, 2014 1:28 AM, "Aaron Davidson" <[email protected]> wrote:
>
>> Currently there is not a way to do this using textFile(). However, you
>> could pretty straightforwardly define your own subclass of HadoopRDD [1] in
>> order to get access to this information (likely using
>> mapPartitionsWithIndex to look up the InputSplit for a particular
>> partition).
>>
>> Note that sc.textFile() is just a convenience function to construct a new
>> HadoopRDD [2].
>>
>> [1] HadoopRDD:
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93
>> [2] sc.textFile():
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
>>
>>
>> On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen <[email protected]>
>> wrote:
>>
>>> Hello,
>>>
>>> A quick question about using spark to parse text-format CSV files stored
>>> on hdfs.
>>>
>>> I have something very simple:
>>> sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p =>
>>> (XXX, p[0], p[2]))
>>>
>>> Here, I want to replace XXX with a string, which is the current csv
>>> filename for the line. This is needed since some information may be encoded
>>> in the file name, like date.
>>>
>>> In hive, I am able to define an external table and use INPUT__FILE__NAME
>>> as a column in queries. I wonder if spark has something similar.
>>>
>>> Thanks!
>>> -Simon
>>>
>>
>>

Re: access hdfs file name in map()

Reply via email to