Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

David Parks Fri, 29 Mar 2013 07:34:45 -0700

CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used it 
primarily with 1.0.3, which is what AWS uses, so I presume that's what it's 
tested on.


Himanish Kushary <[email protected]> wrote:

>Thanks Dave.
>
>
>I had already tried using the s3distcp jar. But got stuck on the below 
>error,which made me think that this is something specific to Amazon hadoop 
>distribution.
>
>
>Exception in thread "Thread-28" java.lang.NoClassDefFoundError: 
>org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream 
>
>
>Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it is 
>not present on the CDH4 (my local env) hadoop jars.
>
>
>Could you suggest how I could get around this issue. One option could be using 
>the amazon specific jars but then probably I would need to get all the jars ( 
>else it could cause version mismatch errors for HDFS - NoSuchMethodError etc 
>etc ) 
>
>
>Appreciate your help regarding this.
>
>
>- Himanish
>
>
>
>
>On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[email protected]> wrote:
>
>None of that complexity, they distribute the jar publicly (not the source, but 
>the jar). You can just add this to your libjars: 
>s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar
>
> 
>
>No VPN or anything, if you can access the internet you can get to S3. 
>
> 
>
>Follow their docs here: 
>http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
>
> 
>
>Doesn’t matter where you’re Hadoop instance is running.
>
> 
>
>Here’s an example of code/parameters I used to run it from within another 
>Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command 
>line normally.
>
> 
>
>       ToolRunner.run(getConf(), new S3DistCp(), new String[] {
>
>              "--src",             "/frugg/image-cache-stage2/",
>
>              "--srcPattern",      ".*part.*",
>
>              "--dest",            "s3n://fruggmapreduce/results-"+env+"/" + 
>JobUtils.isoDate + "/output/itemtable/", 
>
>              "--s3Endpoint",      "s3.amazonaws.com"         });
>
> 
>
>Watch the “srcPattern”, make sure you have that leading `.*`, that one threw 
>me for a loop once.
>
> 
>
>Dave
>
> 
>
> 
>
>From: Himanish Kushary [mailto:[email protected]] 
>Sent: Thursday, March 28, 2013 5:51 PM
>To: [email protected]
>Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
>
> 
>
>Hi Dave,
>
> 
>
>Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could 
>you please provide some details on how i could use the s3distcp from amazon to 
>transfer data from our on-premises hadoop to amazon s3. Wouldn't some kind of 
>VPN be needed between the Amazon EMR instance and our on-premises hadoop 
>instance ? Did you mean use the jar from amazon on our local server ?
>
> 
>
>Thanks
>
>On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[email protected]> wrote:
>
>Have you tried using s3distcp from amazon? I used it many times to transfer 
>1.5TB between S3 and Hadoop instances. The process took 45 min, well over the 
>10min timeout period you’re running into a problem on.
>
> 
>
>Dave
>
> 
>
> 
>
>From: Himanish Kushary [mailto:[email protected]] 
>Sent: Thursday, March 28, 2013 10:54 AM
>To: [email protected]
>Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput
>
> 
>
>Hello,
>
> 
>
>I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using the 
>distcp utility.There are aaround 2200 files distributed over 15 
>directories.The max individual file size is approx 50 MB.
>
> 
>
>The distcp mapreduce job keeps on failing with this error 
>
> 
>
>"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600 
>seconds. Killing!"  
>
> 
>
>and in the task attempt logs I can see lot of INFO messages like 
>
> 
>
>"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception 
>(java.io.IOException) caught when processing request: Resetting to invalid 
>mark"
>
> 
>
>I am thinking either transferring individual folders instead of the entire 70 
>GB folders as a workaround or as another option increasing the 
>"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate 
>of transfer to S3 seems to be 5 MB/s).Is there any other better option to 
>increase the throughput for transferring bulk data from HDFS to S3 ?  Looking 
>forward for suggestions.
>
> 
>
> 
>
>-- 
>Thanks & Regards
>Himanish 
>
>
>
> 
>
>-- 
>Thanks & Regards
>Himanish 
>
>
>
>
>-- 
>Thanks & Regards
>Himanish 
>

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Reply via email to