Two simple suggestions: 1. No need to call zipWithIndex twice. Use the earlier RDD dt. 2. Replace zipWithIndex with zipWithUniqueId which does not trigger a spark job
Below your code with the above changes: var dataRDD = sc.textFile("/test.csv").map(_.split(",")) val dt = dataRDD.*zipWithUniqueId*.map(_.swap) val newCol1 = *dt*.map {case (i, x) => (i, x(1)+x(18)) } val newCol2 = newCol1.join(dt).map(x=> function(.........)) Hope this helps. Kiran On Fri, Jun 5, 2015 at 8:15 AM, Carter <gyz...@hotmail.com> wrote: > Hi, I have a RDD with MANY columns (e.g., hundreds), and most of my > operation > is on columns, e.g., I need to create many intermediate variables from > different columns, what is the most efficient way to do this? > > For example, if my dataRDD[Array[String]] is like below: > > 123, 523, 534, ..., 893 > 536, 98, 1623, ..., 98472 > 537, 89, 83640, ..., 9265 > 7297, 98364, 9, ..., 735 > ...... > 29, 94, 956, ..., 758 > > I will need to create a new column or a variable as newCol1 = > 2ndCol+19thCol, and another new column based on newCol1 and the existing > columns: newCol2 = function(newCol1, 34thCol), what is the best way of > doing > this? > > I have been thinking using index for the intermediate variables and the > dataRDD, and then join them together on the index to do my calculation: > var dataRDD = sc.textFile("/test.csv").map(_.split(",")) > val dt = dataRDD.zipWithIndex.map(_.swap) > val newCol1 = dataRDD.map(x => x(1)+x(18)).zipWithIndex.map(_.swap) > val newCol2 = newCol1.join(dt).map(x=> function(.........)) > > Is there a better way of doing this? > > Thank you very much! > > > > > > > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Column-operation-on-Spark-RDDs-tp23165.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >