Hi , I came from map/reduce background and try to do quite trivial thing:
I have a lot of files ( on hdfs ) - format is : 1 , 2 , 3 2 , 3 , 5 1 , 3, 5 2, 3 , 4 2 , 5, 1 I am actually need to group by key (first column) : key values 1 --> (2,3),(3,5) 2 --> (3,5),(3,4),(5,1) and I need to process (pass) values to the function f ( my custom function) outcome of function f() should be to hdfs with corresponding key: 1 --> f() outcome 2 --> f() outcome. My code is : def doSplit(x): y = x.split(',') if(len(y)==3): return y[0],(y[1],y[2]) lines = sc.textFile(filename,1) counts = lines.map(doSplit).groupByKey() output = counts.collect() for (key, value) in output: print 'build model for key ->' , key print value f(str(key) , value)) Questions: 1) lines.map(doSplit).groupByKey() - I didn't find the option to use groupByKey( f() ) to process grouped values? how can I process grouped keys by custom function? function f has some not trivial logic. 2) Using output ( I really don't like this approach ) to pass to function looks like not scalable and executed only on one machine? What is the way using PySpark process grouped keys in distributed fashion. Multiprocessing and on different machine of the cluster. 3)In case of processing output how data can be stored on hdfs? Thanks Oleg.