Thanks Dave. I had already tried using the s3distcp jar. But got stuck on the below error,which made me think that this is something specific to Amazon hadoop distribution.
Exception in thread "Thread-28" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it is not present on the CDH4 (my local env) hadoop jars. Could you suggest how I could get around this issue. One option could be using the amazon specific jars but then probably I would need to get all the jars ( else it could cause version mismatch errors for HDFS - NoSuchMethodError etc etc ) Appreciate your help regarding this. - Himanish On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[email protected]> wrote: > None of that complexity, they distribute the jar publicly (not the source, > but the jar). You can just add this to your libjars: s3n://*region* > .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar**** > > ** ** > > No VPN or anything, if you can access the internet you can get to S3. **** > > ** ** > > Follow their docs here: > http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html > **** > > ** ** > > Doesn’t matter where you’re Hadoop instance is running.**** > > ** ** > > Here’s an example of code/parameters I used to run it from within another > Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command > line normally.**** > > ** ** > > ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {**** > > "--src", "/frugg/image-cache-stage2/",**** > > "--srcPattern", ".*part.*",**** > > "--dest", "s3n://fruggmapreduce/results-"+env+"/"+ > JobUtils. > *isoDate* + "/output/itemtable/", **** > > "--s3Endpoint", "s3.amazonaws.com" });**** > > ** ** > > Watch the “srcPattern”, make sure you have that leading `.*`, that one > threw me for a loop once.**** > > ** ** > > Dave**** > > ** ** > > ** ** > > *From:* Himanish Kushary [mailto:[email protected]] > *Sent:* Thursday, March 28, 2013 5:51 PM > *To:* [email protected] > *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput** > ** > > ** ** > > Hi Dave,**** > > ** ** > > Thanks for your reply. Our hadoop instance is inside our corporate > LAN.Could you please provide some details on how i could use the s3distcp > from amazon to transfer data from our on-premises hadoop to amazon s3. > Wouldn't some kind of VPN be needed between the Amazon EMR instance and our > on-premises hadoop instance ? Did you mean use the jar from amazon on our > local server ?**** > > ** ** > > Thanks**** > > On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[email protected]> > wrote:**** > > Have you tried using s3distcp from amazon? I used it many times to > transfer 1.5TB between S3 and Hadoop instances. The process took 45 min, > well over the 10min timeout period you’re running into a problem on.**** > > **** > > Dave**** > > **** > > **** > > *From:* Himanish Kushary [mailto:[email protected]] > *Sent:* Thursday, March 28, 2013 10:54 AM > *To:* [email protected] > *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**** > > **** > > Hello,**** > > **** > > I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using > the distcp utility.There are aaround 2200 files distributed over 15 > directories.The max individual file size is approx 50 MB.**** > > **** > > The distcp mapreduce job keeps on failing with this error **** > > **** > > "Task attempt_201303211242_0260_m_000005_0 failed to report status for > 600 seconds. Killing!" **** > > **** > > and in the task attempt logs I can see lot of INFO messages like **** > > **** > > "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark"**** > > **** > > I am thinking either transferring individual folders instead of the entire > 70 GB folders as a workaround or as another option increasing the " > mapred.task.timeout" parameter to something like 6-7 hour ( as the avg > rate of transfer to S3 seems to be 5 MB/s).Is there any other better > option to increase the throughput for transferring bulk data from HDFS to > S3 ? Looking forward for suggestions.**** > > **** > > **** > > -- > Thanks & Regards > Himanish **** > > > > **** > > ** ** > > -- > Thanks & Regards > Himanish **** > -- Thanks & Regards Himanish
