Ramkumar, it sounds like you can consider a file-parallel approach rather
than a strict data-parallel parsing of the problem. In other words,
separate the file copying task from the file parsing task. Have the driver
program D handle the directory scan, which then parallelizes the file list
into N slaves S[1 .. N]. The file contents themselves can be either passed
from driver D to slaves S as (a) a serialized data structure, (b) copied by
the driver D into HDFS, or (c) copied via other distributed filesystem such
as NFS. When the slave processing is complete, it writes the result back
out to HDFS, which is then picked up by D and copied to your desired output
directory structure.

This is admittedly a bit of file copying back and forth over the network,
but if your input structure is some file system, and output structure is
the same, then you'd incur that cost at some point anyway. And if the file
parsing is much more expensive than file transfer, then you do get
significant speed gains in parallelizing the parsing task.

It's also quite conducive to getting to code complete in a hour or less.
KISS.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Oct 10, 2013 at 4:30 PM, Ramkumar Chokkalingam <
[email protected]> wrote:

> Hey,
>
> Thanks for the mail, Matei. Since, I need to have the output  directory
> structure to be same as the input directory structure with some changes
> made to the content of those files while parsing [ replacing certain fields
> with its encrypted value]. I wouldn't want the union to combine few of the
> input files into a single file.
>
> Is there some API which would treat each file as independent and write to
> a output file ? That would've been great.
>
> If it doesn't work, then I have to write them each to a folder and process
> each of them (using some script) to match my input directory structure.
>
>
>
>

Reply via email to