I don't quite get it.. mapPartitionWithIndex takes a function that maps an integer index and an iterator to another iterator. How does that help with retrieving the hdfs file name?
I am obviously missing some context.. Thanks. On May 30, 2014 1:28 AM, "Aaron Davidson" <[email protected]> wrote: > Currently there is not a way to do this using textFile(). However, you > could pretty straightforwardly define your own subclass of HadoopRDD [1] in > order to get access to this information (likely using > mapPartitionsWithIndex to look up the InputSplit for a particular > partition). > > Note that sc.textFile() is just a convenience function to construct a new > HadoopRDD [2]. > > [1] HadoopRDD: > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93 > [2] sc.textFile(): > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 > > > On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen <[email protected]> > wrote: > >> Hello, >> >> A quick question about using spark to parse text-format CSV files stored >> on hdfs. >> >> I have something very simple: >> sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p => >> (XXX, p[0], p[2])) >> >> Here, I want to replace XXX with a string, which is the current csv >> filename for the line. This is needed since some information may be encoded >> in the file name, like date. >> >> In hive, I am able to define an external table and use INPUT__FILE__NAME >> as a column in queries. I wonder if spark has something similar. >> >> Thanks! >> -Simon >> > >
