Ramkumar, it sounds like you can consider a file-parallel approach rather than a strict data-parallel parsing of the problem. In other words, separate the file copying task from the file parsing task. Have the driver program D handle the directory scan, which then parallelizes the file list into N slaves S[1 .. N]. The file contents themselves can be either passed from driver D to slaves S as (a) a serialized data structure, (b) copied by the driver D into HDFS, or (c) copied via other distributed filesystem such as NFS. When the slave processing is complete, it writes the result back out to HDFS, which is then picked up by D and copied to your desired output directory structure.
This is admittedly a bit of file copying back and forth over the network, but if your input structure is some file system, and output structure is the same, then you'd incur that cost at some point anyway. And if the file parsing is much more expensive than file transfer, then you do get significant speed gains in parallelizing the parsing task. It's also quite conducive to getting to code complete in a hour or less. KISS. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Thu, Oct 10, 2013 at 4:30 PM, Ramkumar Chokkalingam < [email protected]> wrote: > Hey, > > Thanks for the mail, Matei. Since, I need to have the output directory > structure to be same as the input directory structure with some changes > made to the content of those files while parsing [ replacing certain fields > with its encrypted value]. I wouldn't want the union to combine few of the > input files into a single file. > > Is there some API which would treat each file as independent and write to > a output file ? That would've been great. > > If it doesn't work, then I have to write them each to a folder and process > each of them (using some script) to match my input directory structure. > > > >
