Nicholas Chammas created HADOOP-15559:
-----------------------------------------

             Summary: Clarity on Spark compatibility with hadoop-aws
                 Key: HADOOP-15559
                 URL: https://issues.apache.org/jira/browse/HADOOP-15559
             Project: Hadoop Common
          Issue Type: Improvement
          Components: documentation, fs/s3
            Reporter: Nicholas Chammas


I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a 
command-line tool for launching Apache Spark clusters on AWS. One of the things 
I try to do for my users is make it straightforward to use Spark with 
{{s3a://}}. I do this by recommending that users start Spark with the 
{{hadoop-aws}} package.

For example:
{code:java}
pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
{code}
I'm struggling, however, to understand what versions of {{hadoop-aws}} should 
work with what versions of Spark.

Spark releases are [built against Hadoop 
2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, I've 
been told that I should be able to use newer versions of Hadoop and Hadoop 
libraries with Spark, so for example, running Spark built against Hadoop 2.7 
alongside HDFS 2.8 should work, and there is [no need to build Spark explicitly 
against Hadoop 
2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].

I'm having trouble translating this mental model into recommendations for how 
to pair Spark with {{hadoop-aws}}.

For example, Spark 2.3.1 built against Hadoop 2.7 works with 
{{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter 
yields the following error when I try to access files via {{s3a://}}.
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
: java.lang.IllegalAccessError: tried to access method 
org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V
 from class org.apache.hadoop.fs.s3a.S3AInstrumentation
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
at 
org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748){code}
So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR 
release of Hadoop that Spark is built against. However, neither [this 
page|https://wiki.apache.org/hadoop/AmazonS3] nor [this 
one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to 
pair the correct version of {{hadoop-aws}} with Spark.

Would it be appropriate to add some guidance somewhere on what versions of 
{{hadoop-aws}} work with what versions and builds of Spark? It would help 
eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to