RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

David Parks Sat, 30 Mar 2013 18:27:25 -0700

4-20MB/sec are common transfer rates from S3 to *1* local AWS box, this was,
of course, a cluster, and s3distcp is specifically designed to take
advantage of the cluster, so it was a 45 minute job to transfer the 1.5 TB
to the full cluster of, I forget how many servers I had at the time, maybe
15-30 m1.xlarge. The numbers are rough, I could be mistaken and it was 1 ½
hours to do the transfer (but I recall 45 min), in either case the s3distcp
job ran longer than the task timeout period, which was the real point I was
focusing on.


 

I seem to recall needing to re-package their jar as well, but for different
reasons, they package in some other open source utilities and I had version
conflicts, so might want to watch for that.

 

Ive never seen this ProgressableResettableBufferedFileInputStream, so I
cant offer much more advise on that one.

 

Good luck! Let us know how it turns out.

Dave

 

 

From: Himanish Kushary [mailto:[email protected]] 
Sent: Friday, March 29, 2013 9:57 PM
To: [email protected]
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs
for 1.0.4 branch (could not find 1.0.3 API's so used
http://hadoop.apache.org/docs/r1.0.4/api/index.html) but did not find the
"ProgressableResettableBufferedFileInputStream" class.Not sure how it is
present in the hadoop-core.jar in Amazon EMR.

 

In the meantime I have come out with a dirty workaround by extracting the
class from the Amazon jar and packaging it into its own separate jar.I am
actually able to run the s3distcp now on local CDH4 using amazon's jar and
transfer from my local hadoop to Amazon S3.

 

But the real issue is the throughput. You mentioned that you had transferred
1.5 TB in 45 mins which comes to around 583 MB/s. I am barely getting 4 MB/s
upload speed !! How did you get 100x times speed compared to me ? Could you
please share any settings/tweaks that you may have done to achieve this.
Were you on some very specific high bandwidth network ? Was is between HDFS
on EC2 and amazon S3 ?

 

Looking forward to hear from you.

 

Thanks

Himanish

 

On Fri, Mar 29, 2013 at 10:34 AM, David Parks <[email protected]>
wrote:

CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used
it primarily with 1.0.3, which is what AWS uses, so I presume that's what
it's tested on.



Himanish Kushary <[email protected]> wrote:

Thanks Dave.

 

I had already tried using the s3distcp jar. But got stuck on the below
error,which made me think that this is something specific to Amazon hadoop
distribution.

 

Exception in thread "Thread-28" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream 

 

Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it is
not present on the CDH4 (my local env) hadoop jars.

 

Could you suggest how I could get around this issue. One option could be
using the amazon specific jars but then probably I would need to get all the
jars ( else it could cause version mismatch errors for HDFS -
NoSuchMethodError etc etc ) 

 

Appreciate your help regarding this.

 

- Himanish

 

 

On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[email protected]> wrote:

None of that complexity, they distribute the jar publicly (not the source,
but the jar). You can just add this to your libjars:
s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar

 

No VPN or anything, if you can access the internet you can get to S3. 

 

Follow their docs here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s
3distcp.html

 

Doesnt matter where youre Hadoop instance is running.

 

Heres an example of code/parameters I used to run it from within another
Tool, its a Tool, so its actually designed to run from the Hadoop command
line normally.

 

       ToolRunner.run(getConf(), new S3DistCp(), new String[] {

              "--src",             "/frugg/image-cache-stage2/",

              "--srcPattern",      ".*part.*",

              "--dest",            "s3n://fruggmapreduce/results-"+env+"/" +
JobUtils.isoDate + "/output/itemtable/", 

              "--s3Endpoint",      "s3.amazonaws.com"         });

 

Watch the srcPattern, make sure you have that leading `.*`, that one threw
me for a loop once.

 

Dave

 

 

From: Himanish Kushary [mailto:[email protected]] 
Sent: Thursday, March 28, 2013 5:51 PM
To: [email protected]
Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hi Dave,

 

Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could
you please provide some details on how i could use the s3distcp from amazon
to transfer data from our on-premises hadoop to amazon s3. Wouldn't some
kind of VPN be needed between the Amazon EMR instance and our on-premises
hadoop instance ? Did you mean use the jar from amazon on our local server ?

 

Thanks

On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[email protected]> wrote:

Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period youre running into a problem on.

 

Dave

 

 

From: Himanish Kushary [mailto:[email protected]] 
Sent: Thursday, March 28, 2013 10:54 AM
To: [email protected]
Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

 

Hello,

 

I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using
the distcp utility.There are aaround 2200 files distributed over 15
directories.The max individual file size is approx 50 MB.

 

The distcp mapreduce job keeps on failing with this error 

 

"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600
seconds. Killing!"  

 

and in the task attempt logs I can see lot of INFO messages like 

 

"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception
(java.io.IOException) caught when processing request: Resetting to invalid
mark"

 

I am thinking either transferring individual folders instead of the entire
70 GB folders as a workaround or as another option increasing the
"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate
of transfer to S3 seems to be 5 MB/s).Is there any other better option to
increase the throughput for transferring bulk data from HDFS to S3 ?
Looking forward for suggestions.

 

 

-- 
Thanks & Regards
Himanish 





 

-- 
Thanks & Regards
Himanish 





 

-- 
Thanks & Regards
Himanish 





 

-- 
Thanks & Regards
Himanish

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

Reply via email to