Re: Column operation on Spark RDDs.

kiran lonikar Mon, 08 Jun 2015 01:39:41 -0700

Two simple suggestions:
1. No need to call zipWithIndex twice. Use the earlier RDD dt.
2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark
job


Below your code with the above changes:

var dataRDD = sc.textFile("/test.csv").map(_.split(","))
val dt = dataRDD.*zipWithUniqueId*.map(_.swap)
val newCol1 = *dt*.map {case (i, x) => (i, x(1)+x(18)) }
val newCol2 = newCol1.join(dt).map(x=> function(.........))

Hope this helps.
Kiran


On Fri, Jun 5, 2015 at 8:15 AM, Carter <gyz...@hotmail.com> wrote:

> Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my
> operation
> is on columns, e.g., I need to create many intermediate variables from
> different columns, what is the most efficient way to do this?
>
> For example, if my dataRDD[Array[String]] is like below:
>
>     123, 523, 534, ..., 893
>     536, 98, 1623, ..., 98472
>     537, 89, 83640, ..., 9265
>     7297, 98364, 9, ..., 735
>     ......
>     29, 94, 956, ..., 758
>
> I will need to create a new column or a variable as newCol1 =
> 2ndCol+19thCol, and another new column based on newCol1 and the existing
> columns: newCol2 = function(newCol1, 34thCol), what is the best way of
> doing
> this?
>
> I have been thinking using index for the intermediate variables and the
> dataRDD, and then join them together on the index to do my calculation:
> var dataRDD = sc.textFile("/test.csv").map(_.split(","))
> val dt = dataRDD.zipWithIndex.map(_.swap)
> val newCol1 = dataRDD.map(x => x(1)+x(18)).zipWithIndex.map(_.swap)
> val newCol2 = newCol1.join(dt).map(x=> function(.........))
>
> Is there a better way of doing this?
>
> Thank you very much!
>
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Column operation on Spark RDDs.

Reply via email to