As a general rule, data in HDFS is more useful than data in NFS, and data
in NFS is more useful than data in local files; so I'd recommend that you
investigate how to get your data into the distributed filesystem early so
that you can work with it in parallel using Spark or other tools that work
with HDFS.  Using Spark to push data into HDFS is possible, but not optimal
-- it will soon become a bottleneck for large datasets.  Moving data across
the network is expensive, so it is worth taking the design time and even
writing custom scripts or code to minimize such transfers and not
necessarily trying to do everything from within Spark.


On Fri, Oct 11, 2013 at 10:59 AM, Ramkumar Chokkalingam <
[email protected]> wrote:

> Thanks Mark, for the response.
>
> I have my input on the server as local files. we haven't thought if we
> might set-up a NFS server. We have configured the server machine -
> installed Hadoop and have HFS setup. To achieve my goal, What is the change
> that you would recommend over the pipeline I suggested ?
>
>

Reply via email to