My thought would be to key by the first item in each array, then take just
one array for each key. Something like the below:
v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5)))
col = 0
output = v.keyBy(_(col)).reduceByKey(a,b => a).values
On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu
Hi,
I have a very simple use case:
I have an rdd as following:
d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]
Now, I want to remove all the duplicates from a column and return the
remaining frame..
For example:
If i want to remove the duplicate based on column 1.
Then basically I would remove either row