Hi Dave,

Thanks for the suggestions. Glad to hear from a fellow DOE national lab person! 

We are just starting to explore all this here at Pacific Northwest Nat Lab, and 
what will be going into Hbase and what will be left as files in HDFS is an open 
question, to be empirically determined over the coming year. It will depend on 
upon what instrument data gets put in, how the users want to analyze the data, 
what turns out to be practical for future growth and maintenance, etc. My lab 
colleagues Kevin Fox and David Brown have a lot more experience handling 
massive amount of data - they are already handling hundreds of TBs in the 
archive cluster for EMSL, our national user facility (lots of mass spec, NMR, 
microscopy, and next gen sequencing machines for biology and chemistry, as you 
may already know). And they have much better grip on the hardware and OS side 
of things. So I imagine you & the list will be hearing directly from them 
fairly often as questions arise.

 Ron

-----Original Message-----
From: Buttler, David [mailto:[email protected]] 
Sent: Monday, January 03, 2011 12:21 PM
To: [email protected]; '[email protected]'
Cc: Fox, Kevin M; Brown, David M JR
Subject: RE: What is the fastest way to get a large amount of data into the 
Hadoop HDFS file system (or Hbase)?

Hi Ron,
Loading into HDFS and HBase are two different issues.  

HDFS: if you have a large number of files to load from your nfs file system 
into HDFS it is not clear that parallelizing the load will help.  You have two 
sources of bottlenecks: the nfs file system and the HDFS file system.  In your 
parallel example, you will likely saturate your nfs file system first.  If they 
are actually local files, then loading them via M/R is a non-starter as you 
have no control over which machine will get a map task.  Unless all of the 
machines have files in the same directory and you are just going to look in 
that directory to upload.  Then, it sounds like more of a job for a parallel 
shell command and less of a map/reduce command.

HBase: So far my strategy has been to get the files into HDFS first, and then 
write a Map job to load them into HBase.  You can try to do this and see if 
direct inserts into hbase are fast enough for your use case.  But, if you are 
going to TBs/week then you will likely want to investigate the bulk load 
features.  I haven't yet incorporated that into my workflow so I can't offer 
much advice there. Just be sure your cluster is sized appropriately.  E.g., 
with your compression turned on in hbase, see how much a 1 GB input file 
expands to inside hbase / hdfs.  That should give you a feeling for how much 
space you will need for your expected data load.

Dave


-----Original Message-----
From: Taylor, Ronald C [mailto:[email protected]] 
Sent: Tuesday, December 28, 2010 2:05 PM
To: '[email protected]'; '[email protected]'
Cc: Taylor, Ronald C; Fox, Kevin M; Brown, David M JR
Subject: What is the fastest way to get a large amount of data into the Hadoop 
HDFS file system (or Hbase)?


Folks,

We plan on uploading large amounts of data on a regular basis onto a Hadoop 
cluster, with Hbase operating on top of Hadoop. Figure eventually on the order 
of multiple terabytes per week. So - we are concerned about doing the uploads 
themselves as fast as possible from our native Linux file system into HDFS. 
Figure files will be in, roughly, the 1 to 300 GB range. 

Off the top of my head, I'm thinking that doing this in parallel using a Java 
MapReduce program would work fastest. So my idea would be to have a file 
listing all the data files (full paths) to be uploaded, one per line, and then 
use that listing file as input to a MapReduce program. 

Each Mapper would then upload one of the data files (using "hadoop fs 
-copyFromLocal <source> <dest>") in parallel with all the other Mappers, with 
the Mappers operating on all the nodes of the cluster, spreading out the file 
upload across the nodes.

Does that sound like a wise way to approach this? Are there better methods? 
Anything else out there for doing automated upload in parallel? We would very 
much appreciate advice in this area, since we believe upload speed might become 
a bottleneck.

  - Ron Taylor

___________________________________________
Ronald Taylor, Ph.D.
Computational Biology & Bioinformatics Group

Pacific Northwest National Laboratory
902 Battelle Boulevard
P.O. Box 999, Mail Stop J4-33
Richland, WA  99352 USA
Office:  509-372-6568
Email: [email protected]


Reply via email to