if the data is coming off of a single machine then simply running multiple
threads on that machine spraying the data into the cluster is likely to be
faster than a map-reduce program.  The reason is that you can run the
spraying process continuously and can tune it to carefully saturate your
outbound link toward the cluster.  With a map-reduce program it will be very
easy to flatten the link.

Another issue is that it is easy to push data to the cluster from a local
disk rather than to pull it from nodes in the cluster because most network
file protocols aren't as efficient as you might like.

On Tue, Dec 28, 2010 at 2:47 PM, Taylor, Ronald C <[email protected]>wrote:

> 2) some way of parallelizing the reads
>
> So - I will check into network hardware, in regard to (1). But for (2), is
> the MapReduce method that I was think of, a way that uses "hadoop fs
> -copyFromLocal" in each Mapper, a good way to go at the destination end? I
> believe that you were saying that it is indeed OK, but I want to
> double-check, since this will be a critical piece of our work flow.
>

Reply via email to