if the data is coming off of a single machine then simply running multiple threads on that machine spraying the data into the cluster is likely to be faster than a map-reduce program. The reason is that you can run the spraying process continuously and can tune it to carefully saturate your outbound link toward the cluster. With a map-reduce program it will be very easy to flatten the link.
Another issue is that it is easy to push data to the cluster from a local disk rather than to pull it from nodes in the cluster because most network file protocols aren't as efficient as you might like. On Tue, Dec 28, 2010 at 2:47 PM, Taylor, Ronald C <[email protected]>wrote: > 2) some way of parallelizing the reads > > So - I will check into network hardware, in regard to (1). But for (2), is > the MapReduce method that I was think of, a way that uses "hadoop fs > -copyFromLocal" in each Mapper, a good way to go at the destination end? I > believe that you were saying that it is indeed OK, but I want to > double-check, since this will be a critical piece of our work flow. >
