Re: confused by reduceByKey usage

诺铁 Thu, 17 Apr 2014 18:49:25 -0700

hi,Cheng,

thank you for let me know this.   so what do you think is better way to
debug?



On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian <lian.cs....@gmail.com> wrote:

> A tip: using println is only convenient when you are working with local
> mode. When running Spark in clustering mode (standalone/YARN/Mesos), output
> of println goes to executor stdout.
>
>
> On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 <noty...@gmail.com> wrote:
>
>> yeah, I got it.!
>> using println to debug is great for me to explore spark.
>> thank you very much for your kindly help.
>>
>>
>>
>> On Fri, Apr 18, 2014 at 12:54 AM, Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>> Here's a way to debug something like this:
>>>
>>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => {
>>>            println("v1: " + v1)
>>>            println("v2: " + v2)
>>>            (v1.split(" ")(1).toInt + v2.split(" ")(1).toInt).toString
>>>        }).collect
>>>
>>> You get:
>>> v1: 1 2 3 4 5
>>> v2: 1 2 3 4 5
>>> v1: 4
>>> v2: 1 2 3 4 5
>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>>
>>> reduceByKey() works kind of like regular Scala reduce(). So it will call
>>> the function on the first two values, then on the result of that and the
>>> next value, then the result of that and the next value, and so on. First
>>> you add 2+2 and get 4. Then your function is called with v1="4" and v2 is
>>> the third line.
>>>
>>> What you could do instead:
>>>
>>> scala> d5.keyBy(_.split(" ")(0)).mapValues(_.split("
>>> ")(1).toInt).reduceByKey((v1, v2) => v1 + v2).collect
>>>
>>>
>>> On Thu, Apr 17, 2014 at 6:29 PM, 诺铁 <noty...@gmail.com> wrote:
>>>
>>>> HI,
>>>>
>>>> I am new to spark,when try to write some simple tests in spark shell, I
>>>> met following problem.
>>>>
>>>> I create a very small text file,name it as 5.txt
>>>> 1 2 3 4 5
>>>> 1 2 3 4 5
>>>> 1 2 3 4 5
>>>>
>>>> and experiment in spark shell:
>>>>
>>>> scala> val d5 = sc.textFile("5.txt").cache()
>>>> d5: org.apache.spark.rdd.RDD[String] = MappedRDD[91] at textFile at
>>>> <console>:12
>>>>
>>>> scala> d5.keyBy(_.split(" ")(0)).reduceByKey((v1,v2) => (v1.split("
>>>> ")(1).toInt + v2.split(" ")(1).toInt).toString).first
>>>>
>>>> then error occurs:
>>>> 14/04/18 00:20:11 ERROR Executor: Exception in task ID 36
>>>> java.lang.ArrayIndexOutOfBoundsException: 1
>>>> at $line60.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:15)
>>>>  at $line60.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:15)
>>>> at
>>>> org.apache.spark.util.collection.ExternalAppendOnlyMap$$anonfun$2.apply(ExternalAppendOnlyMap.scala:120)
>>>>
>>>> when I delete 1 line in the file, and make it 2 lines,the result is
>>>> correct, I don't understand what's the problem, please help me,thanks.
>>>>
>>>>
>>>
>>
>

Re: confused by reduceByKey usage

Reply via email to