[
https://issues.apache.org/jira/browse/SPARK-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14013142#comment-14013142
]
Tathagata Das edited comment on SPARK-1911 at 5/30/14 12:31 AM:
As far as I think, it is because Java 7 uses Zip64 encoding when making JARs
with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it
fails in those times when the Spark assembly JAR has more than 65k files, which
in turn depends on whether it has been generated with YARN and/or Hive enabled.
Java 6 uses the traditional Zip format to create JARs, even if it has more than
65k files. So python always seems to work with Java 6 Jars
Caveat: I cant claim 100% certainty on this interpretation because there is so
little documentation on this on the net.
was (Author: tdas):
As far as I think, it is because Java 7 uses Zip64 encoding when making JARs
with more 2^16 files and python (at least 2.x) is not able to read Zip64. So it
fails in those times when the Spark assembly JAR has more than 65k files, which
in turn depends on whether it has been generated with YARN and/or Hive enabled.
Warn users that jars should be built with Java 6 for PySpark to work on YARN
Key: SPARK-1911
URL: https://issues.apache.org/jira/browse/SPARK-1911
Project: Spark
Issue Type: Sub-task
Components: Documentation
Reporter: Andrew Or
Fix For: 1.0.0
Python sometimes fails to read jars created by Java 7. This is necessary for
PySpark to work in YARN, and so Spark assembly JAR should compiled in Java 6
for PySpark to work on YARN.
Currently we warn users only in make-distribution.sh, but most users build
the jars directly. We should emphasize it in the docs especially for PySpark
and YARN because this issue is not trivial to debug.
--
This message was sent by Atlassian JIRA
(v6.2#6252)