Re: RDD API question

Sean Owen Fri, 14 Feb 2014 09:25:44 -0800

zipWithIndex is a Scala collection method, and also implemented on
RDDs. You can use map transform what you have to what you want --
effectively "selecting" out the things you need.


As Nathan notes this literal join approach might not be the fastest
thing but it should work.
--
Sean Owen | Director, Data Science | London


On Fri, Feb 14, 2014 at 4:47 PM, Sonal Goyal <sonalgoy...@gmail.com> wrote:
> Thanks Sean. Is  zipWtihIndex available in the Java API? Also, how do I
> remove the generated id from further processing?
>
> Best Regards,
> Sonal
> Nube Technologies
>
>
>
>
>
>
> On Fri, Feb 14, 2014 at 9:14 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> You could do a zipWithIndex to add a sort of "row ID" to each element
>> of the input RDD. Then after self-joining, exclude elements whose row
>> ID is the same.
>> --
>> Sean Owen | Director, Data Science | London
>>
>>
>> On Fri, Feb 14, 2014 at 3:42 PM, Sonal Goyal <sonalgoy...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I have some PairRDDs like
>> >
>> > K1 A
>> > K1 B
>> > K1 C
>> >
>> > K2 D
>> > K2 D
>> > K2 E
>> >
>> > and I want to create
>> >
>> > A B
>> > A C
>> > B C
>> > D D
>> > D E
>> >
>> > Whats the best way to do this? If I join the RDD with itself, I will end
>> > up
>> > with A A which I do not want. I cant do distinct as that will filter out
>> > the
>> > D D which I want.
>> >
>> > Any pointers? Thanks.
>> >
>> > Best Regards,
>> > Sonal
>> > Nube Technologies
>> >
>> >
>> >
>> >
>
>

Re: RDD API question

Reply via email to