Hey, sorry, for this question, there's a similar answer to the previous one. 
You'll have to move the files from the output directories into a common 
directory by hand, possibly renaming them. The Hadoop InputFormat and 
OutputFormat APIs that we use are just designed to work at the level of 
directories (one directory represents one dataset).

One other option may be to build a union of multiple RDDs, using 
SparkContext.union(rdd1, rdd2, etc), and then call saveAsTextFile on that. Now 
they'll all be written to the same output location.

Matei

On Oct 6, 2013, at 8:51 PM, Ramkumar Chokkalingam <ramkumar...@gmail.com> wrote:

> 
> Hello, 
> 
> I have started experimenting with Spark Cluster. I have a parallelization job 
> where I want to parse through several folders and each of them has multiple 
> files,which I parse and do some file processing on the files' records and 
> write the whole file back to a output file. I do the same processing 
> operation(Hashing certain fields in the data file) for all the files inside 
> the Folder. Simply, 
> 
> For a directory D, 
>   Read all files inside D. 
>     For each File F
>       Loop: For each line L in File, I do some processing and write my 
> processing output to a file. 
> 
> So if there are 200 files inside input directory - I would like to have 200 
> files in my output directory. I learnt that with SaveAsTextFile(Name) API 
> spark creates a directory with the name we specify (Name) and creates the 
> actual output files inside that folder in the form of part-00000,part-00001 
> etc.. files ( similar to Hadoop, I assumed). 
> My question is there a way where we specify the name of the output directory 
> and redirect all my SaveAsTextFile(DirName) outputs into a single folder 
> rather ?
> 
> Let me know if there is a way of achieving this. If not, I would appreciate 
> hearing some workarounds. Thanks!
> 
> 
> Regards,
> 
> Ramkumar Chokkalingam, 
> Masters Student, University of Washington || 206-747-3515
> 
>  
> 

Reply via email to