I can confirm this bug. The behavior for groupByKey is the same as
reduceByKey - your example is actually grouping on just the name. Try this:

sc.parallelize(ps).map(x=> (x,1)).groupByKey().collect
res1: Array[(P, Iterable[Int])] = Array((P(bob),ArrayBuffer(1)),
(P(bob),ArrayBuffer(1)), (P(alice),ArrayBuffer(1)),
(P(charly),ArrayBuffer(1)))


On Tue, Jul 22, 2014 at 10:30 AM, Gerard Maas <gerard.m...@gmail.com> wrote:

> Just to narrow down the issue, it looks like the issue is in 'reduceByKey'
> and derivates like 'distinct'.
>
> groupByKey() seems to work
>
> sc.parallelize(ps).map(x=> (x.name,1)).groupByKey().collect
> res: Array[(String, Iterable[Int])] = Array((charly,ArrayBuffer(1)),
> (abe,ArrayBuffer(1)), (bob,ArrayBuffer(1, 1)))
>
>
>
> On Tue, Jul 22, 2014 at 4:20 PM, Gerard Maas <gerard.m...@gmail.com>
> wrote:
>
>> Using a case class as a key doesn't seem to work properly. [Spark 1.0.0]
>>
>> A minimal example:
>>
>> case class P(name:String)
>> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
>> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
>> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1),
>> (P(bob),1), (P(abe),1), (P(charly),1))
>>
>> In contrast to the expected behavior, that should be equivalent to:
>> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
>> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
>>
>> Any ideas why this doesn't work?
>>
>> -kr, Gerard.
>>
>
>


-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Reply via email to