Re: access hdfs file name in map()

Xu (Simon) Chen Fri, 01 Aug 2014 10:43:37 -0700

Hi Roberto,

Ultimately, the info you need is set here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69

Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as
HadoopRDDWithEnv, which takes in an additional parameter (varname) in the
constructor, then override the compute() function to return something like
"""split.getPipeEnvVars.getOrElse(varName, "") + "|" + value.toString()"""
as the value. This obviously is less general and makes certain assumptions
about the input data. Also you need to write several wrappers in
SparkContext, so that you can do something like sc.textFileWithEnv("hdfs
path", "mapreduce_map_input_file").

I was hoping to do something like
sc.textFile("hdfs_path").pipe("""/usr/bin/awk
"{print\"${mapreduce_map_input_file}\",$0}" """). This gives me some weird
kyro buffer overflow exception... Haven't got a chance to look into the
details yet.

-Simon

On Fri, Aug 1, 2014 at 7:38 AM, Roberto Torella <roberto.tore...@gmail.com>
wrote:

> Hi Simon,
>
> I'm trying to do the same but I'm quite lost.
>
> How did you do that? (Too direct? :)
>
>
> Thanks and ciao,
> r-
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-tp6551p11160.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: access hdfs file name in map()

Reply via email to