CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used it primarily with 1.0.3, which is what AWS uses, so I presume that's what it's tested on.
Himanish Kushary <[email protected]> wrote: >Thanks Dave. > > >I had already tried using the s3distcp jar. But got stuck on the below >error,which made me think that this is something specific to Amazon hadoop >distribution. > > >Exception in thread "Thread-28" java.lang.NoClassDefFoundError: >org/apache/hadoop/fs/s3native/ProgressableResettableBufferedFileInputStream > > >Also, I noticed that the Amazon EMR hadoop-core.jar has this class but it is >not present on the CDH4 (my local env) hadoop jars. > > >Could you suggest how I could get around this issue. One option could be using >the amazon specific jars but then probably I would need to get all the jars ( >else it could cause version mismatch errors for HDFS - NoSuchMethodError etc >etc ) > > >Appreciate your help regarding this. > > >- Himanish > > > > >On Fri, Mar 29, 2013 at 1:41 AM, David Parks <[email protected]> wrote: > >None of that complexity, they distribute the jar publicly (not the source, but >the jar). You can just add this to your libjars: >s3n://region.elasticmapreduce/libs/s3distcp/latest/s3distcp.jar > > > >No VPN or anything, if you can access the internet you can get to S3. > > > >Follow their docs here: >http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html > > > >Doesn’t matter where you’re Hadoop instance is running. > > > >Here’s an example of code/parameters I used to run it from within another >Tool, it’s a Tool, so it’s actually designed to run from the Hadoop command >line normally. > > > > ToolRunner.run(getConf(), new S3DistCp(), new String[] { > > "--src", "/frugg/image-cache-stage2/", > > "--srcPattern", ".*part.*", > > "--dest", "s3n://fruggmapreduce/results-"+env+"/" + >JobUtils.isoDate + "/output/itemtable/", > > "--s3Endpoint", "s3.amazonaws.com" }); > > > >Watch the “srcPattern”, make sure you have that leading `.*`, that one threw >me for a loop once. > > > >Dave > > > > > >From: Himanish Kushary [mailto:[email protected]] >Sent: Thursday, March 28, 2013 5:51 PM >To: [email protected] >Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput > > > >Hi Dave, > > > >Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could >you please provide some details on how i could use the s3distcp from amazon to >transfer data from our on-premises hadoop to amazon s3. Wouldn't some kind of >VPN be needed between the Amazon EMR instance and our on-premises hadoop >instance ? Did you mean use the jar from amazon on our local server ? > > > >Thanks > >On Thu, Mar 28, 2013 at 3:56 AM, David Parks <[email protected]> wrote: > >Have you tried using s3distcp from amazon? I used it many times to transfer >1.5TB between S3 and Hadoop instances. The process took 45 min, well over the >10min timeout period you’re running into a problem on. > > > >Dave > > > > > >From: Himanish Kushary [mailto:[email protected]] >Sent: Thursday, March 28, 2013 10:54 AM >To: [email protected] >Subject: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput > > > >Hello, > > > >I am trying to transfer around 70 GB of files from HDFS to Amazon S3 using the >distcp utility.There are aaround 2200 files distributed over 15 >directories.The max individual file size is approx 50 MB. > > > >The distcp mapreduce job keeps on failing with this error > > > >"Task attempt_201303211242_0260_m_000005_0 failed to report status for 600 >seconds. Killing!" > > > >and in the task attempt logs I can see lot of INFO messages like > > > >"INFO org.apache.commons.httpclient.HttpMethodDirector: I/O exception >(java.io.IOException) caught when processing request: Resetting to invalid >mark" > > > >I am thinking either transferring individual folders instead of the entire 70 >GB folders as a workaround or as another option increasing the >"mapred.task.timeout" parameter to something like 6-7 hour ( as the avg rate >of transfer to S3 seems to be 5 MB/s).Is there any other better option to >increase the throughput for transferring bulk data from HDFS to S3 ? Looking >forward for suggestions. > > > > > >-- >Thanks & Regards >Himanish > > > > > >-- >Thanks & Regards >Himanish > > > > >-- >Thanks & Regards >Himanish >
