[ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas resolved HADOOP-15559. --------------------------------------- Resolution: Fixed > Clarity on Spark compatibility with hadoop-aws > ---------------------------------------------- > > Key: HADOOP-15559 > URL: https://issues.apache.org/jira/browse/HADOOP-15559 > Project: Hadoop Common > Issue Type: Improvement > Components: documentation, fs/s3 > Reporter: Nicholas Chammas > Priority: Minor > > I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a > command-line tool for launching Apache Spark clusters on AWS. One of the > things I try to do for my users is make it straightforward to use Spark with > {{s3a://}}. I do this by recommending that users start Spark with the > {{hadoop-aws}} package. > For example: > {code:java} > pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4" > {code} > I'm struggling, however, to understand what versions of {{hadoop-aws}} should > work with what versions of Spark. > Spark releases are [built against Hadoop > 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, > I've been told that I should be able to use newer versions of Hadoop and > Hadoop libraries with Spark, so for example, running Spark built against > Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build > Spark explicitly against Hadoop > 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html]. > I'm having trouble translating this mental model into recommendations for how > to pair Spark with {{hadoop-aws}}. > For example, Spark 2.3.1 built against Hadoop 2.7 works with > {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter > yields the following error when I try to access files via {{s3a://}}. > {code:java} > py4j.protocol.Py4JJavaError: An error occurred while calling o35.text. > : java.lang.IllegalAccessError: tried to access method > org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V > from class org.apache.hadoop.fs.s3a.S3AInstrumentation > at > org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194) > at > org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216) > at > org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139) > at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) > at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) > at > org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:238) > at java.lang.Thread.run(Thread.java:748){code} > So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR > release of Hadoop that Spark is built against. However, neither [this > page|https://wiki.apache.org/hadoop/AmazonS3] nor [this > one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to > pair the correct version of {{hadoop-aws}} with Spark. > Would it be appropriate to add some guidance somewhere on what versions of > {{hadoop-aws}} work with what versions and builds of Spark? It would help > eliminate this kind of guesswork and slow spelunking. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org