I think it's clear if you format your function reasonably: mjpJobOrderRDD.map(line => { val tokens = line.split("\t"); if (tokens.length == 164 && tokens(23) != null) { (tokens(23),tokens(7)) } })
In some cases the function returns nothing, in some cases a tuple. The return type is therefore Any. If you just mean to output a result in some cases and not others, you must use flatMap + Some + None: mjpJobOrderRDD.flatMap { line => val tokens = line.split("\t") if (tokens.length == 164 && tokens(23) != null) { Some((tokens(23),tokens(7))) } else { None } } On Wed, Jan 28, 2015 at 7:37 PM, Sanjay Subramanian <sanjaysubraman...@yahoo.com.invalid> wrote: > hey guys > > I am not following why this happens > > DATASET > ======= > Tab separated values (164 columns) > > Spark command 1 > ================ > val mjpJobOrderRDD = sc.textFile("/data/cdr/cdr_mjp_joborder_raw") > val mjpJobOrderColsPairedRDD = mjpJobOrderRDD.map(line => { val tokens = > line.split("\t");(tokens(23),tokens(7))}) > mjpJobOrderColsPairedRDD: org.apache.spark.rdd.RDD[(String, String)] = > MappedRDD[18] at map at <console>:14 > > > Spark command 2 > ================ > val mjpJobOrderRDD = sc.textFile("/data/cdr/cdr_mjp_joborder_raw") > scala> val mjpJobOrderColsPairedRDD = mjpJobOrderRDD.map(line => { val > tokens = line.split("\t"); if (tokens.length == 164 && tokens(23) != null) > {(tokens(23),tokens(7))} }) > mjpJobOrderColsPairedRDD: org.apache.spark.rdd.RDD[Any] = MappedRDD[19] at > map at <console>:14 > > > In the second case above , why does it say org.apache.spark.rdd.RDD[Any] and > not org.apache.spark.rdd.RDD[(String, String)] > > > thanks > > sanjay > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org