My thought would be to key by the first item in each array, then take just one array for each key. Something like the below:
v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5))) col = 0 output = v.keyBy(_(col)).reduceByKey(a,b => a).values On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Hi, > I have a very simple use case: > > I have an rdd as following: > > d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]] > > Now, I want to remove all the duplicates from a column and return the > remaining frame.. > For example: > If i want to remove the duplicate based on column 1. > Then basically I would remove either row 1 or row 2 in my final result.. > because the column 1 of both first and second row is the same element (1) > .. and hence the duplicate.. > So, a possible result is: > > output = [[1,2,3,4],[2,3,4,5]] > > How do I do this in spark? > Thanks >