Hi

I wanted to ask whats the best way to achieve per key auto increment
numerals after sorting, for eg. :

raw file:

1,a,b,c,1,1
1,a,b,d,0,0
1,a,b,e,1,0
2,a,e,c,0,0
2,a,f,d,1,0

post-output (the last column is the position number after grouping on
first three fields and reverse sorting on last two values)

1,a,b,c,1,1,1
1,a,b,d,0,0,3
1,a,b,e,1,0,2
2,a,e,c,0,0,2
2,a,f,d,1,0,1

I am using solution that uses groupbykey but that is running into some
issues (possibly bug with pyspark/spark?), wondering if there is a
better way to achieve this.

My solution:

A = A = sc.textFile("train.csv").filter(lambda x:not
isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
None)

B = A.map(lambda k:
((k.first_field,k.second_field,k.first_field,k.third_field),
(k[0:5]))).groupByKey()

B.map(sort_n_set_position).flatMap(lambda line: line)

where sort and set position iterates over the iterator and performs
sorting and adding last column.

best fahad

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to