Here's a way to debug something like this: scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => { println("v1: " + v1) println("v2: " + v2) (v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString }).collect
You get: v1: 1 2 3 4 5 v2: 1 2 3 4 5 v1: 4 v2: 1 2 3 4 5 java.lang.ArrayIndexOutOfBoundsException: 1 reduceByKey() works kind of like regular Scala reduce(). So it will call the function on the first two values, then on the result of that and the next value, then the result of that and the next value, and so on. First you add 2+2 and get 4. Then your function is called with v1="4" and v2 is the third line. What you could do instead: scala> d5.keyBy(_.split(" ")(0)).mapValues(_.split(" ")(1).toInt).reduceByKey((v1, v2) => v1 + v2).collect On Thu, Apr 17, 2014 at 6:29 PM, 诺铁 <noty...@gmail.com> wrote: > HI, > > I am new to spark,when try to write some simple tests in spark shell, I > met following problem. > > I create a very small text file,name it as 5.txt > 1 2 3 4 5 > 1 2 3 4 5 > 1 2 3 4 5 > > and experiment in spark shell: > > scala> val d5 = sc.textFile("5.txt").cache() > d5: org.apache.spark.rdd.RDD[String] = MappedRDD[91] at textFile at > <console>:12 > > scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => (v1.split(" > ")(1).toInt + v2.split(" ")(1).toInt).toString).first > > then error occurs: > 14/04/18 00:20:11 ERROR Executor: Exception in task ID 36 > java.lang.ArrayIndexOutOfBoundsException: 1 > at $line60.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:15) > at $line60.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:15) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap$$anonfun$2.apply(ExternalAppendOnlyMap.scala:120) > > when I delete 1 line in the file, and make it 2 lines,the result is > correct, I don't understand what's the problem, please help me,thanks. > >