Patrick,

Thanks for the info (and quick reply). I want to make sure I understand: 
Presuming that the data files are coming off a set of disk drives attached to a 
single Linux file server, you say I need two things to optimize the transfer:

1)  a fat network pipe

2) some way of parallelizing the reads

So - I will check into network hardware, in regard to (1). But for (2), is the 
MapReduce method that I was think of, a way that uses "hadoop fs 
-copyFromLocal" in each Mapper, a good way to go at the destination end? I 
believe that you were saying that it is indeed OK, but I want to double-check, 
since this will be a critical piece of our work flow.

 Ron


________________________________
From: [email protected] [mailto:[email protected]] On Behalf Of 
Patrick Angeles
Sent: Tuesday, December 28, 2010 2:27 PM
To: [email protected]
Cc: [email protected]; Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: Re: What is the fastest way to get a large amount of data into the 
Hadoop HDFS file system (or Hbase)?

Ron,

While MapReduce can help to parallelize the load effort, your likely bottleneck 
is the source system (where the files come from). If the files are coming from 
a single server, then parallelizing the load won't gain you much past a certain 
point. You have to figure in how fast you can read the file(s) off disk(s) and 
push the bits through your network and finally onto HDFS.

The best scenario is if you can parallelize the reads and have a fat network 
pipe (10GbE or more) going into your Hadoop cluster.

Regards,

- Patrick

On Tue, Dec 28, 2010 at 5:04 PM, Taylor, Ronald C 
<[email protected]<mailto:[email protected]>> wrote:

Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop 
cluster, with Hbase operating on top of Hadoop. Figure eventually on the order 
of multiple terabytes per week. So - we are concerned about doing the uploads 
themselves as fast as possible from our native Linux file system into HDFS. 
Figure files will be in, roughly, the 1 to 300 GB range.

Off the top of my head, I'm thinking that doing this in parallel using a Java 
MapReduce program would work fastest. So my idea would be to have a file 
listing all the data files (full paths) to be uploaded, one per line, and then 
use that listing file as input to a MapReduce program.

Each Mapper would then upload one of the data files (using "hadoop fs 
-copyFromLocal <source> <dest>") in parallel with all the other Mappers, with 
the Mappers operating on all the nodes of the cluster, spreading out the file 
upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? 
Anything else out there for doing automated upload in parallel? We would very 
much appreciate advice in this area, since we believe upload speed might become 
a bottleneck.

 - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: [email protected]<mailto:[email protected]>



Reply via email to