Spark 1.5.0

data:

p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0

spark-shell:

spark-shell \
    --num-executors 2 \
    --driver-memory 1g \
    --executor-memory 10g \
    --executor-cores 8 \
    --master yarn-client


case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
f4:Char, f5:Char, f6:String)
case class Myvalue(count1:Long, count2:Long, num:Double)

val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
    val spl = line.split("\\|", -1)
    val k = spl(0).split(",")
    val v = spl(1).split(",")
    (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
k(5)(0).toChar, k(6)(0).toChar, k(7)),
     Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
    )
}}

myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
}.collect().foreach(println)

(Mykey(p1,lo1,8,0,4,0,5,20150901),1)

(Mykey(p1,lo1,8,0,4,0,5,20150901),1)
(Mykey(p1,lo3,8,0,4,0,5,20150901),1)
(Mykey(p1,lo3,8,0,4,0,5,20150901),1)
(Mykey(p1,lo4,8,0,4,0,5,20150901),1)
(Mykey(p1,lo4,8,0,4,0,5,20150901),1)
(Mykey(p1,lo2,8,0,4,0,5,20150901),1)
(Mykey(p1,lo2,8,0,4,0,5,20150901),1)



You can see that each key is repeated 2 times but each key should only
appear once.

Arun

On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Can you give a bit more information ?
>
> Release of Spark you're using
> Minimal dataset that shows the problem
>
> Cheers
>
> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <arun.lut...@gmail.com> wrote:
>
>> I tried groupByKey and noticed that it did not group all values into the
>> same group.
>>
>> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
>> distinct keys, so I expected there to be 4 records in the groupByKey
>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>> times.
>>
>> Is this the expected behavior? I need to be able to get ALL values
>> associated with each key grouped into a SINGLE record. Is it possible?
>>
>> Arun
>>
>> p.s. reducebykey will not be sufficient for me
>>
>
>

Reply via email to