Sato, Also, we did see a different error entirely when we didn't set the fs.s3n.impl, but I can try removing that property in development now that we have it working to verify.
But the "it has never been done in previous versions" is irrelevant, IMO. This was a big change and that certainly could have changed, but if you're looking at the code then I'm likely wrong. William Watson Software Engineer (904) 705-7056 PCS On Wed, Apr 22, 2015 at 9:02 AM, Billy Watson <[email protected]> wrote: > Chris and Sato, > > Thanks a bunch! I've been so swamped by these and other issues we've been > having in scrambling to upgrade our cluster that I forgot to file a bug. I > certainly complained aloud that the docs were insufficient, but I didn't do > anything to help the community so thanks a bunch for recognizing that and > helping me out! > > William Watson > Software Engineer > (904) 705-7056 PCS > > On Wed, Apr 22, 2015 at 3:06 AM, Takenori Sato <[email protected]> wrote: > >> Hi Billy, Chris, >> >> Let me share a couple of my findings. >> >> I believe this was introduced by HADOOP-10893, >> which was introduced from 2.6.0(HDP2.2). >> >> 1. fs.s3n.impl >> >> > We added a property to the core-site.xml file: >> >> You don't need to explicitly set this. It has never been done so in >> previous versions. >> >> Take a look at FileSystem#loadFileSystem, which is called from >> FileSystem#getFileSystemClass. >> Subclasses of FileSystem are loaded automatically if they are available >> on a classloader you care. >> >> So you just need to make sure hadoop-aws.jar is on a classpath. >> >> For file system shell, this is done in hadoop-env.sh, >> while for a MR job, in mapreduce.application.classpath, >> or for YARN, in yarn.application.classpath. >> >> 2. mapreduce.application.classpath >> >> > And updated the classpath for mapreduce applications: >> >> Note that it points to a distributed cache on the default HDP 2.2 >> distribution. >> >> <property> >> <name>mapreduce.application.classpath</name> >> >> <value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure</value> >> </property> >> * $PWD/mr-framework/hadoop/share/hadoop/tools/lib/* contains >> hadoop-aws.jar(S3NFileSystem) >> >> While on a vanilla hadoop, it looks like standard paths as yours. >> >> <property> >> <name>mapreduce.application.classpath</name> >> >> <value>/hadoop-2.6.0/etc/hadoop:/hadoop-2.6.0/share/hadoop/common/lib/*:/hadoop-2.6.0/share/hadoop/common/*:/hadoop-2.6.0/share/hadoop/hdfs:/hadoop-2.6.0/share/hadoop/hdfs/lib/*:/hadoop-2.6.0/share/hadoop/hdfs/*:/hadoop-2.6.0/share/hadoop/yarn/lib/*:/hadoop-2.6.0/share/hadoop/yarn/*:/hadoop-2.6.0/share/hadoop/mapreduce/lib/*:/hadoop-2.6.0/share/hadoop/mapreduce/*:/hadoop-2.6.0/contrib/capacity-scheduler/*.jar:/hadoop-2.6.0/share/hadoop/tools/lib/*</value> >> </property> >> >> Thanks, >> Sato >> >> On Wed, Apr 22, 2015 at 3:10 PM, Chris Nauroth <[email protected]> >> wrote: >> >>> Hello Billy, >>> >>> I think your experience indicates that our documentation is >>> insufficient for discussing how to configure and use the alternative file >>> systems. I filed issue HADOOP-11863 to track a documentation enhancement. >>> >>> https://issues.apache.org/jira/browse/HADOOP-11863 >>> >>> Please feel free to watch that issue if you'd like to be informed as >>> it makes progress. Thank you for reporting back to the thread after you >>> had a solution. >>> >>> Chris Nauroth >>> Hortonworks >>> http://hortonworks.com/ >>> >>> >>> From: Billy Watson <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Monday, April 20, 2015 at 11:14 AM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Unable to Find S3N Filesystem Hadoop 2.6 >>> >>> We found the correct configs. >>> >>> This post was helpful, but didn't entirely work for us out of the box >>> since we are using hadoop-pseudo-distributed. >>> http://hortonworks.com/community/forums/topic/s3n-error-for-hdp-2-2/ >>> >>> We added a property to the core-site.xml file: >>> >>> <property> >>> <name>fs.s3n.impl</name> >>> <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value> >>> <description>Tell hadoop which class to use to access s3 URLs. This >>> change became necessary in hadoop 2.6.0</description> >>> </property> >>> >>> And updated the classpath for mapreduce applications: >>> >>> <property> >>> <name>mapreduce.application.classpath</name> >>> >>> <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_MAPRED_HOME/share/hadoop/tools/lib/*</value> >>> <description>The classpath specifically for mapreduce jobs. This >>> override is nec. so that s3n URLs work on hadoop 2.6.0+</description> >>> </property> >>> >>> William Watson >>> Software Engineer >>> (904) 705-7056 PCS >>> >>> On Mon, Apr 20, 2015 at 11:13 AM, Billy Watson <[email protected] >>> > wrote: >>> >>>> Thanks, anyways. Anyone else run into this issue? >>>> >>>> William Watson >>>> Software Engineer >>>> (904) 705-7056 PCS >>>> >>>> On Mon, Apr 20, 2015 at 11:11 AM, Jonathan Aquilina < >>>> [email protected]> wrote: >>>> >>>>> Sadly I'll have to pull back I have only run a Hadoop map reduce >>>>> cluster with Amazon met >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On 20 Apr 2015, at 16:53, Billy Watson <[email protected]> >>>>> wrote: >>>>> >>>>> This is an install on a CentOS 6 virtual machine used in our test >>>>> environment. We use HDP in staging and production and we discovered these >>>>> issues while trying to build a new cluster using HDP 2.2 which upgrades >>>>> from Hadoop 2.4 to Hadoop 2.6. >>>>> >>>>> William Watson >>>>> Software Engineer >>>>> (904) 705-7056 PCS >>>>> >>>>> On Mon, Apr 20, 2015 at 10:26 AM, Jonathan Aquilina < >>>>> [email protected]> wrote: >>>>> >>>>>> One thing I think which i most likely missed completely is are you >>>>>> using an amazon EMR cluster or something in house? >>>>>> >>>>>> >>>>>> >>>>>> --- >>>>>> Regards, >>>>>> Jonathan Aquilina >>>>>> Founder Eagle Eye T >>>>>> >>>>>> On 2015-04-20 16:21, Billy Watson wrote: >>>>>> >>>>>> I appreciate the response. These JAR files aren't 3rd party. They're >>>>>> included with the Hadoop distribution, but in Hadoop 2.6 they stopped >>>>>> being >>>>>> loaded by default and now they have to be loaded manually, if needed. >>>>>> >>>>>> Essentially the problem boils down to: >>>>>> >>>>>> - need to access s3n URLs >>>>>> - cannot access without including the tools directory >>>>>> - after including tools directory in HADOOP_CLASSPATH, failures start >>>>>> happening later in job >>>>>> - need to find right env variable (or shell script or w/e) to include >>>>>> jets3t & other JARs needed to access s3n URLs (I think) >>>>>> >>>>>> >>>>>> >>>>>> William Watson >>>>>> Software Engineer >>>>>> (904) 705-7056 PCS >>>>>> >>>>>> On Mon, Apr 20, 2015 at 9:58 AM, Jonathan Aquilina < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> you mention an environmental variable. the step before you specify >>>>>>> the steps to run to get to the result. you can specify a bash script >>>>>>> that >>>>>>> will allow you to put any 3rd party jar files, for us we used esri, on >>>>>>> the >>>>>>> cluster and propagate them to all nodes in the cluster as well. You can >>>>>>> ping me off list if you need further help. Thing is I havent used pig >>>>>>> but >>>>>>> my boss and coworker wrote the mappers and reducers. to get these jars >>>>>>> to >>>>>>> the entire cluster was a super small and simple bash script. >>>>>>> >>>>>>> >>>>>>> >>>>>>> --- >>>>>>> Regards, >>>>>>> Jonathan Aquilina >>>>>>> Founder Eagle Eye T >>>>>>> >>>>>>> On 2015-04-20 15:17, Billy Watson wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am able to run a `hadoop fs -ls s3n://my-s3-bucket` from the >>>>>>> command line without issue. I have set some options in hadoop-env.sh to >>>>>>> make sure all the S3 stuff for hadoop 2.6 is set up correctly. (This was >>>>>>> very confusing, BTW and not enough searchable documentation on changes >>>>>>> to >>>>>>> the s3 stuff in hadoop 2.6 IMHO). >>>>>>> >>>>>>> Anyways, when I run a pig job which accesses s3, it gets to 16%, >>>>>>> does not fail in pig, but rather fails in mapreduce with "Error: >>>>>>> java.io.IOException: No FileSystem for scheme: s3n." >>>>>>> >>>>>>> I have added [hadoop-install-loc]/lib and >>>>>>> [hadoop-install-loc]/share/hadoop/tools/lib/ to the HADOOP_CLASSPATH env >>>>>>> variable in hadoop-env.sh.erb. When I do not do this, the pig job will >>>>>>> fail >>>>>>> at 0% (before it ever gets to mapreduce) with a very similar "No >>>>>>> fileystem >>>>>>> for scheme s3n" error. >>>>>>> >>>>>>> I feel like at this point I just have to add the >>>>>>> share/hadoop/tools/lib directory (and maybe lib) to the right >>>>>>> environment >>>>>>> variable, but I can't figure out which environment variable that should >>>>>>> be. >>>>>>> >>>>>>> I appreciate any help, thanks!! >>>>>>> >>>>>>> >>>>>>> Stack trace: >>>>>>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) >>>>>>> at >>>>>>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) >>>>>>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at >>>>>>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) >>>>>>> at >>>>>>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at >>>>>>> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at >>>>>>> org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at >>>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:498) >>>>>>> at >>>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:467) >>>>>>> at >>>>>>> org.apache.pig.piggybank.storage.CSVExcelStorage.setLocation(CSVExcelStorage.java:609) >>>>>>> at >>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.mergeSplitSpecificConf(PigInputFormat.java:129) >>>>>>> at >>>>>>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.createRecordReader(PigInputFormat.java:103) >>>>>>> at >>>>>>> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.<init>(MapTask.java:512) >>>>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:755) at >>>>>>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at >>>>>>> org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at >>>>>>> java.security.AccessController.doPrivileged(Native Method) at >>>>>>> javax.security.auth.Subject.doAs(Subject.java:415) at >>>>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) >>>>>>> at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) >>>>>>> >>>>>>> >>>>>>> — Billy Watson >>>>>>> >>>>>>> -- >>>>>>> William Watson >>>>>>> Software Engineer >>>>>>> (904) 705-7056 PCS >>>>>>> >>>>>>> >>>>> >>>> >>> >> >
