Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Himanish Kushary Mon, 01 Apr 2013 05:53:30 -0700

I was able to transfer the data to S3 successfully with the earlier
mentioned work-around.Also I was able to max out our available upload
bandwidth.I could get average around 10 MB/s from the cluster.


I ran the s3distcp jobs with the default timeout and did not face any
issues.

Thanks all for the help.

Himanish


On Sat, Mar 30, 2013 at 9:26 PM, David Parks <[email protected]> wrote:

> 4-20MB/sec are common transfer rates from S3 to **1** local AWS box, this
> was, of course, a cluster, and s3distcp is specifically designed to take
> advantage of the cluster, so it was a 45 minute job to transfer the 1.5 TB
> to the full cluster of, I forget how many servers I had at the time, maybe
> 15-30 m1.xlarge. The numbers are rough, I could be mistaken and it was 1 ½
> hours to do the transfer (but I recall 45 min), in either case the s3distcp
> job ran longer than the task timeout period, which was the real point I was
> focusing on.****
>
> ** **
>
> I seem to recall needing to re-package their jar as well, but for
> different reasons, they package in some other open source utilities and I
> had version conflicts, so might want to watch for that.****
>
> ** **
>
> I’ve never seen this ProgressableResettableBufferedFileInputStream, so I
> can’t offer much more advise on that one.****
>
> ** **
>
> Good luck! Let us know how it turns out.****
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:[email protected]]
> *Sent:* Friday, March 29, 2013 9:57 PM
>
> *To:* [email protected]
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs
> for 1.0.4 branch (could not find 1.0.3 API's so used
> http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find 
> the"ProgressableResettableBufferedFileInputStream"
> class.Not sure how it is present in the hadoop-core.jar in Amazon EMR.****
>
> ** **
>
> In the meantime I have come out with a dirty workaround by extracting the
> class from the Amazon jar and packaging it into its own separate jar.I am
> actually able to run the s3distcp now on local CDH4 using amazon's jar and
> transfer from my local hadoop to Amazon S3.****
>
> ** **
>
> But the real issue is the throughput. You mentioned that you had
> transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely
> getting 4 MB/s upload speed !! How did you get 100x times speed compared to
> me ? Could you please share any settings/tweaks that you may have done
> to achieve this. Were you on some very specific high bandwidth network ?
> Was is between HDFS on EC2 and amazon S3 ?****
>
> ** **
>
> Looking forward to hear from you.****
>
> ** **
>
> Thanks****
>
> Himanish****
>
> ** **
>
> On Fri, Mar 29, 2013 at 10:34 AM, David Parks <[email protected]>
> wrote:****
>
> CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used
> it primarily with 1.0.3, which is what AWS uses, so I presume that's what
> it's tested on.****
>
>
>
> Himanish Kushary <[email protected]> wrote:****
>
> Thanks Dave.****
>
> ** **
>
> I had already tried using the s3distcp jar. But got stuck on the below
> error,which made me think that this is something specific to Amazon hadoop
> distribution.****
>
> ** **
>
> Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
> org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream
>  ****
>
> ** **
>
> Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
> is not present on the CDH4 (my local env) hadoop jars.****
>
> ** **
>
> Could you suggest how I could get around this issue. One option could be
> using the amazon specific jars but then probably I would need to get all
> the jars ( else it could cause version mismatch errors for HDFS -
> NoSuchMethodError etc etc ) ****
>
> ** **
>
> Appreciate your help regarding this.****
>
> ** **
>
> - Himanish****
>
> ** **
>
> ** **
>
> On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[email protected]>
> wrote:****
>
> None of that complexity, they distribute the jar publicly (not the source,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
>  ****
>
> No VPN or anything, if you can access the internet you can get to S3. ****
>
>  ****
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
> ****
>
>  ****
>
> Doesn’t matter where you’re Hadoop instance is running.****
>
>  ****
>
> Here’s an example of code/parameters I used to run it from within another
> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
> line normally.****
>
>  ****
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/"+ 
> JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
>  ****
>
> Watch the “srcPattern”, make sure you have that leading `.*`, that one
> threw me for a loop once.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:[email protected]]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* [email protected]
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
>  ****
>
> Hi Dave,****
>
>  ****
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
>  ****
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[email protected]>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:[email protected]]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* [email protected]
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
>  ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Reply via email to