Currently there is not a way to do this using textFile(). However, you could pretty straightforwardly define your own subclass of HadoopRDD [1] in order to get access to this information (likely using mapPartitionsWithIndex to look up the InputSplit for a particular partition).
Note that sc.textFile() is just a convenience function to construct a new HadoopRDD [2]. [1] HadoopRDD: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93 [2] sc.textFile(): https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen <[email protected]> wrote: > Hello, > > A quick question about using spark to parse text-format CSV files stored > on hdfs. > > I have something very simple: > sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p => > (XXX, p[0], p[2])) > > Here, I want to replace XXX with a string, which is the current csv > filename for the line. This is needed since some information may be encoded > in the file name, like date. > > In hive, I am able to define an external table and use INPUT__FILE__NAME > as a column in queries. I wonder if spark has something similar. > > Thanks! > -Simon >
