[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the beginning 
of the main method...
you get the following output : 
https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty before creating the s3n filesystem since up to that 
point the Filesystem class is not yet loaded by the classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty before creating the s3n filesystem since up to that 
point the Filesystem class is not yet loaded by the classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found [here] 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found [here] 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:
```
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out
```
within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for 
scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files
 Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:
```
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out
```
within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for 
scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files
 Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled 
(SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem we get 
the exception (at this point the app jar has the s3n implementation in it and 
its on the class path but that scheme is not loaded in the static map of the 
Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output above when 
--packages is used. The first print is before creating the s3n filesystem. We 
use reflection there to get the static map's entries btw. When --packages is 
not used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
Reproduce run:
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for 
scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files
 Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled 
(SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem we get 
the exception (at this point the app jar has the s3n implementation in it and 
its on the class path but that scheme is not loaded in the static map of the 
Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints