I don't know what you mean by gateway but in order to have a rough idea of the time needed you need 3 values * amount of data you want to put on hadoop * hadoop bandwidth with regards to local storage (read/write) * bandwidth between where your data are stored and where the hadoop cluster is
For the latter, for big volumes, physically moving the volumes is a viable solution. It will depends on your constraints of course : budget, speed... Bertrand On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh <[email protected]> wrote: > Hi Bertrand, > > By Physically movi ng the data do you mean that the data volume is > connected to the gateway machine and the data is loaded from the local copy > using copyFromLocal? > > Thanks, > Sumit > > > ________________________________ > From: Bertrand Dechoux <[email protected]> > To: [email protected]; sumit ghosh <[email protected]> > Sent: Tuesday, 30 October 2012 3:46 PM > Subject: Re: Loading Data to HDFS > > It might sound like a deprecated way but can't you move the data > physically? > From what I understand, it is one shot and not "streaming" so it could be a > good method if you the access of course. > > Regards > > Bertrand > > On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh <[email protected]> wrote: > > > Hi, > > > > I have a data on remote machine accessible over ssh. I have Hadoop CDH4 > > installed on RHEL. I am planning to load quite a few Petabytes of Data > onto > > HDFS. > > > > Which will be the fastest method to use and are there any projects around > > Hadoop which can be used as well? > > > > > > I cannot install Hadoop-Client on the remote machine. > > > > Have a great Day Ahead! > > Sumit. > > > > > > --------------- > > Here I am attaching my previous discussion on CDH-user to avoid > > duplication. > > --------------- > > On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur <[email protected]> > > wrote: > > in addition to jarcec's suggestions, you could use httpfs. then you'd > only > > need to poke a single host:port in your firewall as all the traffic goes > > thru it. > > thx > > Alejandro > > > > On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho <[email protected]> > > wrote: > > > Hi Sumit, > > > there is plenty of ways how to achieve that. Please find my feedback > > below: > > > > > >> Does Sqoop support loading flat files to HDFS? > > > > > > No, sqoop is supporting only data move from external database and > > warehouse systems. Copying files is not supported at the moment. > > > > > >> Can use distcp? > > > > > > No. Distcp can be used only to copy data between HDFS filesystesm. > > > > > >> How do we use the core-site.xml file on the remote machine to use > > >> copyFromLocal? > > > > > > Yes you can install hadoop binaries on your machine (with no hadoop > > running services) and use hadoop binary to upload data. Installation > > procedure is described in CDH4 installation guide [1] (follow "client" > > installation). > > > > > > Another way that I can think of is leveraging WebHDFS [2] or maybe > > hdfs-fuse [3]? > > > > > > Jarcec > > > > > > Links: > > > 1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation > > > 2: > > > https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS > > > 3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS > > > > > > On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote: > > >> > > >> > > >> Hi, > > >> > > >> I have a data on remote machine accessible over ssh. What is the > fastest > > >> way to load data onto HDFS? > > >> > > >> Does Sqoop support loading flat files to HDFS? > > >> Can use distcp? > > >> How do we use the core-site.xml file on the remote machine to use > > >> copyFromLocal? > > >> > > >> Which will be the best to use and are there any other open source > > projects > > >> around Hadoop which can be used as well? > > >> Have a great Day Ahead! > > >> Sumit > > > > > -- > Bertrand Dechoux > -- Bertrand Dechoux
