Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Himanish Kushary Fri, 29 Mar 2013 06:19:00 -0700

Thanks Dave.

I had already tried using the s3distcp jar. But got stuck on the below
error,which made me think that this is something specific to Amazon hadoop
distribution.


Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream

Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it
is not present on the CDH4 (my local env) hadoop jars.

Could you suggest how I could get around this issue. One option could be
using the amazon specific jars but then probably I would need to get all
the jars ( else it could cause version mismatch errors for HDFS -
NoSuchMethodError etc etc )

Appreciate your help regarding this.

- Himanish



On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[email protected]> wrote:

> None of that complexity, they distribute the jar publicly (not the source,
> but the jar). You can just add this to your libjars: s3n://*region*
> .elasticmapreduce/libs/s3distcp/*latest*/s3distcp.jar****
>
> ** **
>
> No VPN or anything, if you can access the internet you can get to S3. ****
>
> ** **
>
> Follow their docs here:
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
> ****
>
> ** **
>
> Doesn’t matter where you’re Hadoop instance is running.****
>
> ** **
>
> Here’s an example of code/parameters I used to run it from within another
> Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command
> line normally.****
>
> ** **
>
>        ToolRunner.*run*(getConf(), *new* S3DistCp(), *new* String[] {****
>
>               "--src",             "/frugg/image-cache-stage2/",****
>
>               "--srcPattern",      ".*part.*",****
>
>               "--dest",            "s3n://fruggmapreduce/results-"+env+"/"+ 
> JobUtils.
> *isoDate* + "/output/itemtable/", ****
>
>               "--s3Endpoint",      "s3.amazonaws.com"         });****
>
> ** **
>
> Watch the “srcPattern”, make sure you have that leading `.*`, that one
> threw me for a loop once.****
>
> ** **
>
> Dave****
>
> ** **
>
> ** **
>
> *From:* Himanish Kushary [mailto:[email protected]]
> *Sent:* Thursday, March 28, 2013 5:51 PM
> *To:* [email protected]
> *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput**
> **
>
> ** **
>
> Hi Dave,****
>
> ** **
>
> Thanks for your reply. Our hadoop instance is inside our corporate
> LAN.Could you please provide some details on how i could use the s3distcp
> from amazon to transfer data from our on-premises hadoop to amazon s3.
> Wouldn't some kind of VPN be needed between the Amazon EMR instance and our
> on-premises hadoop instance ? Did you mean use the jar from amazon on our
> local server ?****
>
> ** **
>
> Thanks****
>
> On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[email protected]>
> wrote:****
>
> Have you tried using s3distcp from amazon? I used it many times to
> transfer 1.5TB between S3 and Hadoop instances. The process took 45 min,
> well over the 10min timeout period you’re running into a problem on.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* Himanish Kushary [mailto:[email protected]]
> *Sent:* Thursday, March 28, 2013 10:54 AM
> *To:* [email protected]
> *Subject:* Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput****
>
>  ****
>
> Hello,****
>
>  ****
>
> I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
> the distcp utility.There are aaround 2200 files distributed over 15
> directories.The max individual file size is approx 50 MB.****
>
>  ****
>
> The distcp mapreduce job keeps on failing with this error ****
>
>  ****
>
> "Task attempt_201303211242_0260_m_000005_0 failed to report status for
> 600 seconds. Killing!"  ****
>
>  ****
>
> and in the task attempt logs I can see lot of INFO messages like ****
>
>  ****
>
> "INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
> (java.io.IOException) caught when processing request: Resetting to invalid
> mark"****
>
>  ****
>
> I am thinking either transferring individual folders instead of the entire
> 70 GB folders as a workaround or as another option increasing the "
> mapred.task.timeout" parameter to something like 6-7 hour ( as the avg
> rate of transfer to S3 seems to be 5 MB/s).Is there any other better
> option to increase the throughput for transferring bulk data from HDFS to
> S3 ?  Looking forward for suggestions.****
>
>  ****
>
>  ****
>
> --
> Thanks & Regards
> Himanish ****
>
>
>
> ****
>
> ** **
>
> --
> Thanks & Regards
> Himanish ****
>



-- 
Thanks & Regards
Himanish

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Reply via email to