Hi folks, I am writing to ask how to filter and partition a set of files thru Spark. The situation is that I have N big files (cannot fit into single machine). And each line of files starts with a category (say Sport, Food, etc), while only have less than 100 categories actually. I need a program to scan the file set and aggregate each line by category and save them separately in different folders with right partition. For instance, I want the program to generate a Sport folder which contains all lines of data with category sport. Also not like to put all things into a single file which might be too big. Any ideas how to implement this logic efficiently by Spark? I believe groupBy is not acceptable since even all data belongs to a single category is too big to fit into a single machine. RegardsYunsima