Hello,

I have started experimenting with Spark Cluster. I have a parallelization
job where I want to parse through several folders and each of them has
multiple files,which I parse and do some file processing on the files'
records and write the whole file back to a output file. I do the same
processing operation(Hashing certain fields in the data file) for all the
files inside the Folder. Simply,

*For a directory D, *
*  Read all files inside D. *
*    For each File F
*
*      Loop: For each line L in File, I do some processing and write my
processing output to a file. *

So if there are 200 files inside input directory - I would like to have 200
files in my output directory. I learnt that with *SaveAsTextFile(Name) *API
spark creates a directory with the name we specify (Name) and creates the
actual output files inside that folder in the form of part-00000,part-00001
etc.. files ( similar to Hadoop, I assumed).
My question is there a way where we specify the name of the output
directory and *redirect all my SaveAsTextFile(DirName) outputs into a
single folder* rather ?

Let me know if there is a way of achieving this. If not, I would appreciate
hearing some workarounds. Thanks!


Regards,

Ramkumar Chokkalingam,
Masters Student, University of Washington || 206-747-3515
 <http://www.linkedin.com/in/mynameisram>

Reply via email to