N/M.. I wrote a HadoopRDD subclass and append one env field of the HadoopPartition to the value in compute function. It worked pretty well.
Thanks! On Jun 4, 2014 12:22 AM, "Xu (Simon) Chen" <[email protected]> wrote: > I don't quite get it.. > > mapPartitionWithIndex takes a function that maps an integer index and an > iterator to another iterator. How does that help with retrieving the hdfs > file name? > > I am obviously missing some context.. > > Thanks. > On May 30, 2014 1:28 AM, "Aaron Davidson" <[email protected]> wrote: > >> Currently there is not a way to do this using textFile(). However, you >> could pretty straightforwardly define your own subclass of HadoopRDD [1] in >> order to get access to this information (likely using >> mapPartitionsWithIndex to look up the InputSplit for a particular >> partition). >> >> Note that sc.textFile() is just a convenience function to construct a new >> HadoopRDD [2]. >> >> [1] HadoopRDD: >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L93 >> [2] sc.textFile(): >> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456 >> >> >> On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen <[email protected]> >> wrote: >> >>> Hello, >>> >>> A quick question about using spark to parse text-format CSV files stored >>> on hdfs. >>> >>> I have something very simple: >>> sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p => >>> (XXX, p[0], p[2])) >>> >>> Here, I want to replace XXX with a string, which is the current csv >>> filename for the line. This is needed since some information may be encoded >>> in the file name, like date. >>> >>> In hive, I am able to define an external table and use INPUT__FILE__NAME >>> as a column in queries. I wonder if spark has something similar. >>> >>> Thanks! >>> -Simon >>> >> >>
