I was able to transfer the data to S3 successfully with the earlier mentioned work-around.Also I was able to max out our available upload bandwidth.I could get average around 10 MB/s from the cluster.
I ran the s3distcp jobs with the default timeout and did not face any issues. Thanks all for the help. Himanish On Sat, Mar 30, 2013 at 9:26 PM, David Parks <[email protected]> wrote: > 4-20MB/sec are common transfer rates from S3 to **1** local AWS box, this > was, of course, a cluster, and s3distcp is specifically designed to take > advantage of the cluster, so it was a 45 minute job to transfer the 1.5 TB > to the full cluster of, I forget how many servers I had at the time, maybe > 15-30 m1.xlarge. The numbers are rough, I could be mistaken and it was 1 ½ > hours to do the transfer (but I recall 45 min), in either case the s3distcp > job ran longer than the task timeout period, which was the real point I was > focusing on.**** > > ** ** > > I seem to recall needing to re-package their jar as well, but for > different reasons, they package in some other open source utilities and I > had version conflicts, so might want to watch for that.**** > > ** ** > > I’ve never seen this ProgressableResettableBufferedFileInputStream, so I > can’t offer much more advise on that one.**** > > ** ** > > Good luck! Let us know how it turns out.**** > > Dave**** > > ** ** > > ** ** > > *From:* Himanish Kushary [mailto:[email protected]] > *Sent:* Friday, March 29, 2013 9:57 PM > > *To:* [email protected] > *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput** > ** > > ** ** > > Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs > for 1.0.4 branch (could not find 1.0.3 API's so used > http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find > the"ProgressableResettableBufferedFileInputStream" > class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.**** > > ** ** > > In the meantime I have come out with a dirty workaround by extracting the > class from the Amazon jar and packaging it into its own separate jar.I am > actually able to run the s3distcp now on local CDH4 using amazon's jar and > transfer from my local hadoop to Amazon S3.**** > > ** ** > > But the real issue is the throughput. You mentioned that you had > transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely > getting 4 MB/s upload speed !! How did you get 100x times speed compared to > me ? Could you please share any settings/tweaks that you may have done > to achieve this. Were you on some very specific high bandwidth network ? > Was is between HDFS on EC2 and amazon S3 ?**** > > ** ** > > Looking forward to hear from you.**** > > ** ** > > Thanks**** > > Himanish**** > > ** ** > > On Fri, Mar 29, 2013 at 10:34 AM, David Parks <[email protected]> > wrote:**** > > CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used > it primarily with 1.0.3, which is what AWS uses, so I presume that's what > it's tested on.**** > > > > Himanish Kushary <[email protected]> wrote:**** > > Thanks Dave.**** > > ** ** > > I had already tried using the s3distcp jar. But got stuck on the below > error,which made me think that this is something specific to Amazon hadoop > distribution.**** > > ** ** > > Exception in thread "Thread-28" java.lang.NoClassDefFoundError: > org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream > **** > > ** ** > > Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it > is not present on the CDH4 (my local env) hadoop jars.**** > > ** ** > > Could you suggest how I could get around this issue. One option could be > using the amazon specific jars but then probably I would need to get all > the jars ( else it could cause version mismatch errors for HDFS - > NoSuchMethodError etc etc ) **** > > ** ** > > Appreciate your help regarding this.**** > > ** ** > > - Himanish**** > > ** ** > > ** ** > > On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[email protected]> > wrote:**** > > None of that complexity, they distribute the jar publicly (not the source, > but the jar). You can just add this to your libjars: s3n://*region* > .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar**** > > **** > > No VPN or anything, if you can access the internet you can get to S3. **** > > **** > > Follow their docs here: > http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html > **** > > **** > > Doesn’t matter where you’re Hadoop instance is running.**** > > **** > > Here’s an example of code/parameters I used to run it from within another > Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command > line normally.**** > > **** > > ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {**** > > "--src", "/frugg/image-cache-stage2/",**** > > "--srcPattern", ".*part.*",**** > > "--dest", "s3n://fruggmapreduce/results-"+env+"/"+ > JobUtils. > *isoDate* + "/output/itemtable/", **** > > "--s3Endpoint", "s3.amazonaws.com" });**** > > **** > > Watch the “srcPattern”, make sure you have that leading `.*`, that one > threw me for a loop once.**** > > **** > > Dave**** > > **** > > **** > > *From:* Himanish Kushary [mailto:[email protected]] > *Sent:* Thursday, March 28, 2013 5:51 PM > *To:* [email protected] > *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput** > ** > > **** > > Hi Dave,**** > > **** > > Thanks for your reply. Our hadoop instance is inside our corporate > LAN.Could you please provide some details on how i could use the s3distcp > from amazon to transfer data from our on-premises hadoop to amazon s3. > Wouldn't some kind of VPN be needed between the Amazon EMR instance and our > on-premises hadoop instance ? Did you mean use the jar from amazon on our > local server ?**** > > **** > > Thanks**** > > On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[email protected]> > wrote:**** > > Have you tried using s3distcp from amazon? I used it many times to > transfer 1.5TB between S3 and Hadoop instances. The process took 45 min, > well over the 10min timeout period you’re running into a problem on.**** > > **** > > Dave**** > > **** > > **** > > *From:* Himanish Kushary [mailto:[email protected]] > *Sent:* Thursday, March 28, 2013 10:54 AM > *To:* [email protected] > *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**** > > **** > > Hello,**** > > **** > > I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using > the distcp utility.There are aaround 2200 files distributed over 15 > directories.The max individual file size is approx 50 MB.**** > > **** > > The distcp mapreduce job keeps on failing with this error **** > > **** > > "Task attempt_201303211242_0260_m_000005_0 failed to report status for > 600 seconds. Killing!" **** > > **** > > and in the task attempt logs I can see lot of INFO messages like **** > > **** > > "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception > (java.io.IOException) caught when processing request: Resetting to invalid > mark"**** > > **** > > I am thinking either transferring individual folders instead of the entire > 70 GB folders as a workaround or as another option increasing the " > mapred.task.timeout" parameter to something like 6-7 hour ( as the avg > rate of transfer to S3 seems to be 5 MB/s).Is there any other better > option to increase the throughput for transferring bulk data from HDFS to > S3 ? Looking forward for suggestions.**** > > **** > > **** > > -- > Thanks & Regards > Himanish **** > > > > **** > > **** > > -- > Thanks & Regards > Himanish **** > > > > **** > > ** ** > > -- > Thanks & Regards > Himanish **** > > > > **** > > ** ** > > -- > Thanks & Regards > Himanish **** > -- Thanks & Regards Himanish
