Hello, I have started experimenting with Spark Cluster. I have a parallelization job where I want to parse through several folders and each of them has multiple files,which I parse and do some file processing on the files' records and write the whole file back to a output file. I do the same processing operation(Hashing certain fields in the data file) for all the files inside the Folder. Simply,
*For a directory D, * * Read all files inside D. * * For each File F * * Loop: For each line L in File, I do some processing and write my processing output to a file. * So if there are 200 files inside input directory - I would like to have 200 files in my output directory. I learnt that with *SaveAsTextFile(Name) *API spark creates a directory with the name we specify (Name) and creates the actual output files inside that folder in the form of part-00000,part-00001 etc.. files ( similar to Hadoop, I assumed). My question is there a way where we specify the name of the output directory and *redirect all my SaveAsTextFile(DirName) outputs into a single folder* rather ? Let me know if there is a way of achieving this. If not, I would appreciate hearing some workarounds. Thanks! Regards, Ramkumar Chokkalingam, Masters Student, University of Washington || 206-747-3515 <http://www.linkedin.com/in/mynameisram>
