Re: filter operation in pyspark

2014-03-03 Thread Mayur Rustagi
Could be a number of issues.. maybe your csv is not allowing map tasks to be broken, of the file is not process-node local.. how many tasks are you seeing in spark web ui for map & store data. are all the nodes being used when you look at task level .. is the time taken by each task roughly equal o

filter operation in pyspark

2014-03-03 Thread Mohit Singh
Hi, I have a csv file... (say "n" columns ) I am trying to do a filter operation like: query = rdd.filter(lambda x:x[1] == "1234") query.take(20) Basically this would return me rows with that specific value? This manipulation is taking quite some time to execute.. (if i can compare.. maybe slo