[ 
https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved HADOOP-15559.
---------------------------------------
    Resolution: Fixed

> Clarity on Spark compatibility with hadoop-aws
> ----------------------------------------------
>
>                 Key: HADOOP-15559
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15559
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: documentation, fs/s3
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
> command-line tool for launching Apache Spark clusters on AWS. One of the 
> things I try to do for my users is make it straightforward to use Spark with 
> {{s3a://}}. I do this by recommending that users start Spark with the 
> {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
> work with what versions of Spark.
> Spark releases are [built against Hadoop 
> 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, 
> I've been told that I should be able to use newer versions of Hadoop and 
> Hadoop libraries with Spark, so for example, running Spark built against 
> Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build 
> Spark explicitly against Hadoop 
> 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how 
> to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with 
> {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
> yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method 
> org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
>  from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at 
> org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
> release of Hadoop that Spark is built against. However, neither [this 
> page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
> one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
> pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of 
> {{hadoop-aws}} work with what versions and builds of Spark? It would help 
> eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to