James Porritt created SPARK-24559:
-------------------------------------

             Summary: Some zip files passed with spark-submit --archives 
causing "invalid CEN header" error
                 Key: SPARK-24559
                 URL: https://issues.apache.org/jira/browse/SPARK-24559
             Project: Spark
          Issue Type: Bug
          Components: Spark Submit
    Affects Versions: 2.2.0
            Reporter: James Porritt


I'm encountering an error when submitting some zip files to spark-submit using 
--archive that are over 2Gb and have the zip64 flag set.

{{PYSPARK_PYTHON=./ROOT/myspark/bin/python 
/usr/hdp/current/spark2-client/bin/spark-submit \}}
{{ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ROOT/myspark/bin/python \}}
{{ --master=yarn \}}
{{ --deploy-mode=cluster \}}
{{ --driver-memory=4g \}}
{{ --archives=myspark.zip#ROOT \}}
{{ --num-executors=32 \}}
{{ --packages com.databricks:spark-avro_2.11:4.0.0 \}}
{{ foo.py}}

(As a bit of background, I'm trying to prepare files using the trick of zipping 
a conda environment and passing the zip file via --archives, as per: 
https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html)

myspark.zip is a zipped conda environment. It was created using python with the 
zipfile pacakge. The files are stored without deflation and with the zip64 flag 
set. foo.py is my application code. This normally works, but if myspark.zip is 
greater than 2Gb and has the zip64 flag set I get:

java.util.zip.ZipException: invalid CEN header (bad signature)

There seems to be much written on the subject, and I was able to write Java 
code that utilises the java.util.zip library that both does and doesn't 
encounter this error for one of the problematic zip files.

Spark compile info:

{{Welcome to}}
{{ ____ __}}
{{ / __/__ ___ _____/ /__}}
{{ _\ \/ _ \/ _ `/ __/ '_/}}
{{ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.4.0-91}}
{{ /_/}}

{{Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_112}}
{{Branch HEAD}}
{{Compiled by user jenkins on 2018-01-04T10:41:05Z}}
{{Revision a24017869f5450397136ee8b11be818e7cd3facb}}
{{Url g...@github.com:hortonworks/spark2.git}}
{{Type --help for more information.}}

YARN logs on console after above command. I've tried both --deploy-mode=cluster 
and --deploy-mode=client.

{{18/06/13 16:00:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable}}
{{18/06/13 16:00:23 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.}}
{{18/06/13 16:00:23 INFO RMProxy: Connecting to ResourceManager at 
myhost2.myfirm.com/10.87.11.17:8050}}
{{18/06/13 16:00:23 INFO Client: Requesting a new application from cluster with 
6 NodeManagers}}
{{18/06/13 16:00:23 INFO Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (221184 MB per 
container)}}
{{18/06/13 16:00:23 INFO Client: Will allocate AM container, with 18022 MB 
memory including 1638 MB overhead}}
{{18/06/13 16:00:23 INFO Client: Setting up container launch context for our 
AM}}
{{18/06/13 16:00:23 INFO Client: Setting up the launch environment for our AM 
container}}
{{18/06/13 16:00:23 INFO Client: Preparing resources for our AM container}}
{{18/06/13 16:00:24 INFO Client: Use hdfs cache file as spark.yarn.archive for 
HDP, 
hdfsCacheFile:hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/06/13 16:00:24 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/06/13 16:00:24 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/com.databri}}
{{cks_spark-avro_2.11-4.0.0.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.slf4j_slf4j-api-1.}}
{{7.5.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.apache.avro_avro-}}
{{1.7.6.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar 
-> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org}}
{{.codehaus.jackson_jackson-core-asl-1.9.13.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar 
-> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/o}}
{{rg.codehaus.jackson_jackson-mapper-asl-1.9.13.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/com.tho}}
{{ughtworks.paranamer_paranamer-2.3.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.xerial.s}}
{{nappy_snappy-java-1.0.5.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.ap}}
{{ache.commons_commons-compress-1.4.1.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.tukaani_xz-1.0.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.tukaani_xz-1.0.jar}}
{{18/06/13 16:00:26 INFO Client: Source and destination file systems are the 
same. Not copying hdfs:/user/myuser/release/alphagenspark.zip#ROOT}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/my/script/dir/spark/alphagen/foo.py -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/foo.py}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/usr/hdp/current/spark2-client/python/lib/pyspark.zip -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/pyspark.zip}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/py4j-0.10.4-src}}
{{.zip}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar added 
multiple times to distributed cache.}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times 
to distributed cache.}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple 
times to distributed cache.}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar 
added multiple times to distributed cache.}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar 
added multiple times to distributed cache.}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added 
multiple times to distributed cache.}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added 
multiple times to distributed cache.}}
{{18/06/13 16:00:26 WARN Client: Same path resource 
file:/home/myuser/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar 
added multiple times to distributed cache.}}{{18/06/13 16:00:26 WARN Client: 
Same path resource file:/home/myuser/.ivy2/jars/org.tukaani_xz-1.0.jar added 
multiple times to distributed cache.}}
{{18/06/13 16:00:27 INFO Client: Uploading resource 
file:/tmp/spark-6c26ae3b-7248-488f-bc33-9766251474bb/__spark_conf__4405623606341803690.zip
 -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/__spark_conf__.zip}}
{{18/06/13 16:00:27 INFO SecurityManager: Changing view acls to: myuser}}
{{18/06/13 16:00:27 INFO SecurityManager: Changing modify acls to: myuser}}
{{18/06/13 16:00:27 INFO SecurityManager: Changing view acls groups to:}}
{{18/06/13 16:00:27 INFO SecurityManager: Changing modify acls groups to:}}
{{18/06/13 16:00:27 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(myuser); groups 
with view permissions: Set(); users with modify permissions: Set(myuser); 
groups with modify permissions: Set()}}
{{18/06/13 16:00:27 INFO Client: Submitting application 
application_1528901858967_0019 to ResourceManager}}
{{18/06/13 16:00:27 INFO YarnClientImpl: Submitted application 
application_1528901858967_0019}}
{{18/06/13 16:00:28 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:28 INFO Client:}}
{{ client token: N/A}}
{{ diagnostics: AM container is launched, waiting for AM container to Register 
with RM}}
{{ ApplicationMaster host: N/A}}
{{ ApplicationMaster RPC port: -1}}
{{ queue: default}}
{{ start time: 1528923627110}}
{{ final status: UNDEFINED}}
{{ tracking URL: 
http://myhost2.myfirm.com:8088/proxy/application_1528901858967_0019/}}
{{ user: myuser}}
{{18/06/13 16:00:29 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:30 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:31 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:32 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:33 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:34 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:35 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:36 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:37 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:38 INFO Client: Application report for 
application_1528901858967_0019 (state: ACCEPTED)}}
{{18/06/13 16:00:39 INFO Client: Application report for 
application_1528901858967_0019 (state: FAILED)}}
{{18/06/13 16:00:39 INFO Client:}}
{{ client token: N/A}}
{{ diagnostics: Application application_1528901858967_0019 failed 2 times due 
to AM Container for appattempt_1528901858967_0019_000002 exited with exitCode: 
-1000}}
{{For more detailed output, check the application tracking page: 
http://myhost2.myfirm.com:8088/cluster/app/application_1528901858967_0019 Then 
click on links to logs of each attempt.}}
{{Diagnostics: java.util.zip.ZipException: invalid CEN header (bad signature)}}
{{Failing this attempt. Failing the application.}}
{{ ApplicationMaster host: N/A}}
{{ ApplicationMaster RPC port: -1}}
{{ queue: default}}
{{ start time: 1528923627110}}
{{ final status: FAILED}}
{{ tracking URL: 
http://myhost2.myfirm.com:8088/cluster/app/application_1528901858967_0019}}
{{ user: myuser}}
{{18/06/13 16:00:39 INFO Client: Deleted staging directory 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019}}
{{Exception in thread "main" org.apache.spark.SparkException: Application 
application_1528901858967_0019 finished with failed status}}
{{ at org.apache.spark.deploy.yarn.Client.run(Client.scala:1187)}}
{{ at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1233)}}
{{ at org.apache.spark.deploy.yarn.Client.main(Client.scala)}}
{{ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)}}
{{ at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)}}
{{ at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)}}
{{ at java.lang.reflect.Method.invoke(Method.java:498)}}
{{ at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:782)}}
{{ at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)}}
{{ at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)}}
{{ at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)}}
{{ at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)}}
{{18/06/13 16:00:39 INFO ShutdownHookManager: Shutdown hook called}}
{{18/06/13 16:00:39 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-6c26ae3b-7248-488f-bc33-9766251474bb}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to