Re: Output to a single directory with multiple files rather multiple directories ?

Matei Zaharia Thu, 10 Oct 2013 18:40:00 -0700

Yeah, Christopher answered this before I could, but you can list the directory 
in the driver nodes, find out all the filenames, and then use 
SparkContext.parallelize() on an array of filenames to split the set of 
filenames among tasks. After that, run a foreach() on the parallelized RDD and 
have your tasks read the input file and write the corresponding output file 
using the HDFS API.


Matei

On Oct 10, 2013, at 5:50 PM, Christopher Nguyen <c...@adatao.com> wrote:

> Ramkumar, it sounds like you can consider a file-parallel approach rather 
> than a strict data-parallel parsing of the problem. In other words, separate 
> the file copying task from the file parsing task. Have the driver program D 
> handle the directory scan, which then parallelizes the file list into N 
> slaves S[1 .. N]. The file contents themselves can be either passed from 
> driver D to slaves S as (a) a serialized data structure, (b) copied by the 
> driver D into HDFS, or (c) copied via other distributed filesystem such as 
> NFS. When the slave processing is complete, it writes the result back out to 
> HDFS, which is then picked up by D and copied to your desired output 
> directory structure.
> 
> This is admittedly a bit of file copying back and forth over the network, but 
> if your input structure is some file system, and output structure is the 
> same, then you'd incur that cost at some point anyway. And if the file 
> parsing is much more expensive than file transfer, then you do get 
> significant speed gains in parallelizing the parsing task.
> 
> It's also quite conducive to getting to code complete in a hour or less. KISS.
> 
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao
> linkedin.com/in/ctnguyen
> 
> 
> 
> On Thu, Oct 10, 2013 at 4:30 PM, Ramkumar Chokkalingam 
> <ramkumar...@gmail.com> wrote:
> Hey, 
> 
> Thanks for the mail, Matei. Since, I need to have the output  directory 
> structure to be same as the input directory structure with some changes made 
> to the content of those files while parsing [ replacing certain fields with 
> its encrypted value]. I wouldn't want the union to combine few of the input 
> files into a single file. 
> 
> Is there some API which would treat each file as independent and write to a 
> output file ? That would've been great. 
> 
> If it doesn't work, then I have to write them each to a folder and process 
> each of them (using some script) to match my input directory structure. 
> 
> 
>  
>

Re: Output to a single directory with multiple files rather multiple directories ?

Reply via email to