Hi I wanted to ask whats the best way to achieve per key auto increment numerals after sorting, for eg. :
raw file: 1,a,b,c,1,1 1,a,b,d,0,0 1,a,b,e,1,0 2,a,e,c,0,0 2,a,f,d,1,0 post-output (the last column is the position number after grouping on first three fields and reverse sorting on last two values) 1,a,b,c,1,1,1 1,a,b,d,0,0,3 1,a,b,e,1,0,2 2,a,e,c,0,0,2 2,a,f,d,1,0,1 I am using solution that uses groupbykey but that is running into some issues (possibly bug with pyspark/spark?), wondering if there is a better way to achieve this. My solution: A = A = sc.textFile("train.csv").filter(lambda x:not isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is None) B = A.map(lambda k: ((k.first_field,k.second_field,k.first_field,k.third_field), (k[0:5]))).groupByKey() B.map(sort_n_set_position).flatMap(lambda line: line) where sort and set position iterates over the iterator and performs sorting and adding last column. best fahad --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org