[jira] [Created] (SPARK-24559) Some zip files passed with spark-submit --archives causing "invalid CEN header" error
James Porritt created SPARK-24559: - Summary: Some zip files passed with spark-submit --archives causing "invalid CEN header" error Key: SPARK-24559 URL: https://issues.apache.org/jira/browse/SPARK-24559 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.2.0 Reporter: James Porritt I'm encountering an error when submitting some zip files to spark-submit using --archive that are over 2Gb and have the zip64 flag set. {{PYSPARK_PYTHON=./ROOT/myspark/bin/python /usr/hdp/current/spark2-client/bin/spark-submit \}} {{ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ROOT/myspark/bin/python \}} {{ --master=yarn \}} {{ --deploy-mode=cluster \}} {{ --driver-memory=4g \}} {{ --archives=myspark.zip#ROOT \}} {{ --num-executors=32 \}} {{ --packages com.databricks:spark-avro_2.11:4.0.0 \}} {{ foo.py}} (As a bit of background, I'm trying to prepare files using the trick of zipping a conda environment and passing the zip file via --archives, as per: https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html) myspark.zip is a zipped conda environment. It was created using python with the zipfile pacakge. The files are stored without deflation and with the zip64 flag set. foo.py is my application code. This normally works, but if myspark.zip is greater than 2Gb and has the zip64 flag set I get: java.util.zip.ZipException: invalid CEN header (bad signature) There seems to be much written on the subject, and I was able to write Java code that utilises the java.util.zip library that both does and doesn't encounter this error for one of the problematic zip files. Spark compile info: {{Welcome to}} {{ __}} {{ / __/__ ___ _/ /__}} {{ _\ \/ _ \/ _ `/ __/ '_/}} {{ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.4.0-91}} {{ /_/}} {{Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_112}} {{Branch HEAD}} {{Compiled by user jenkins on 2018-01-04T10:41:05Z}} {{Revision a24017869f5450397136ee8b11be818e7cd3facb}} {{Url g...@github.com:hortonworks/spark2.git}} {{Type --help for more information.}} YARN logs on console after above command. I've tried both --deploy-mode=cluster and --deploy-mode=client. {{18/06/13 16:00:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable}} {{18/06/13 16:00:23 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.}} {{18/06/13 16:00:23 INFO RMProxy: Connecting to ResourceManager at myhost2.myfirm.com/10.87.11.17:8050}} {{18/06/13 16:00:23 INFO Client: Requesting a new application from cluster with 6 NodeManagers}} {{18/06/13 16:00:23 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (221184 MB per container)}} {{18/06/13 16:00:23 INFO Client: Will allocate AM container, with 18022 MB memory including 1638 MB overhead}} {{18/06/13 16:00:23 INFO Client: Setting up container launch context for our AM}} {{18/06/13 16:00:23 INFO Client: Setting up the launch environment for our AM container}} {{18/06/13 16:00:23 INFO Client: Preparing resources for our AM container}} {{18/06/13 16:00:24 INFO Client: Use hdfs cache file as spark.yarn.archive for HDP, hdfsCacheFile:hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}} {{18/06/13 16:00:24 INFO Client: Source and destination file systems are the same. Not copying hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}} {{18/06/13 16:00:24 INFO Client: Uploading resource file:/home/myuser/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar -> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/com.databri}} {{cks_spark-avro_2.11-4.0.0.jar}} {{18/06/13 16:00:26 INFO Client: Uploading resource file:/home/myuser/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.slf4j_slf4j-api-1.}} {{7.5.jar}} {{18/06/13 16:00:26 INFO Client: Uploading resource file:/home/myuser/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.apache.avro_avro-}} {{1.7.6.jar}} {{18/06/13 16:00:26 INFO Client: Uploading resource file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar -> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org}} {{.codehaus.jackson_jackson-core-asl-1.9.13.jar}} {{18/06/13 16:00:26 INFO Client: Uploading resource file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar -> hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/o}}
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however, I can't seem to reduce it to a sample. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Another error is: {noformat} if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 385, in getNumPartitions AttributeError: 'NoneType' object has no attribute 'size' {noformat} This is happening at multiple points in my code. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however, I can't seem to reduce it to a sample. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however, I can't seem to reduce it to a sample. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code:python}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'}} Another error is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code:python}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'}} Another error is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)}} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: x = a.subtract(b) y = b.subtract(a) I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. The error I will get is: File "", line 642, in if y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add' Sometimes the error will complain about it not having a 'size' parameter. > subtract creating empty DataFrame that isn't initialised properly > -- > > Key: SPARK-22468 > URL: https://issues.apache.org/jira/browse/SPARK-22468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: James Porritt > > I have an issue whereby a subtract between two DataFrames that will correctly > end up with an empty DataFrame, seemingly has the
[jira] [Created] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
James Porritt created SPARK-22468: - Summary: subtract creating empty DataFrame that isn't initialised properly Key: SPARK-22468 URL: https://issues.apache.org/jira/browse/SPARK-22468 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: James Porritt I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: x = a.subtract(b) y = b.subtract(a) I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. The error I will get is: File "", line 642, in if y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add' Sometimes the error will complain about it not having a 'size' parameter. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits
[ https://issues.apache.org/jira/browse/SPARK-20809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt resolved SPARK-20809. --- Resolution: Fixed Solution was to specify -driver-memory on the command line. > PySpark: Java heap space issue despite apparently being within memory limits > > > Key: SPARK-20809 > URL: https://issues.apache.org/jira/browse/SPARK-20809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Linux x86_64 >Reporter: James Porritt > > I have the following script: > {code} > import itertools > import loremipsum > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > conf = SparkConf().set("spark.cores.max", "16") \ > .set("spark.driver.memory", "16g") \ > .set("spark.executor.memory", "16g") \ > .set("spark.executor.memory_overhead", "16g") \ > .set("spark.driver.maxResultsSize", "0") > sc = SparkContext(appName="testRDD", conf=conf) > ss = SparkSession(sc) > j = itertools.cycle(range(8)) > rows = [(i, j.next(), ' '.join(map(lambda x: x[2], > loremipsum.generate_sentences(600 for i in range(500)] * 100 > rrd = sc.parallelize(rows, 128) > {code} > When I run it with: > {noformat} > /spark-2.1.1-bin-hadoop2.7/bin/spark-submit directory>/writeTest.py > {noformat} > it fails with a 'Java heap space' error: > {noformat} > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. > : java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468) > at > org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The data I create here approximates my actual data. The third element of each > tuple should be around 25k, and there are 50k tuples overall. I estimate that > I should have around 1.2G of data. > Why then does it fail? All parts of the system should have enough memory? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits
[ https://issues.apache.org/jira/browse/SPARK-20809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020744#comment-16020744 ] James Porritt commented on SPARK-20809: --- Many thanks, this put me on track for the solution. I needed to put --driver-memory=16g rather than set it in the code. I'd done some tests on the sentence generator and worked out how to get it to give me a 25K string, which multiplied by 50,000 is about 1.2G. > PySpark: Java heap space issue despite apparently being within memory limits > > > Key: SPARK-20809 > URL: https://issues.apache.org/jira/browse/SPARK-20809 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.1 > Environment: Linux x86_64 >Reporter: James Porritt > > I have the following script: > {code} > import itertools > import loremipsum > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > conf = SparkConf().set("spark.cores.max", "16") \ > .set("spark.driver.memory", "16g") \ > .set("spark.executor.memory", "16g") \ > .set("spark.executor.memory_overhead", "16g") \ > .set("spark.driver.maxResultsSize", "0") > sc = SparkContext(appName="testRDD", conf=conf) > ss = SparkSession(sc) > j = itertools.cycle(range(8)) > rows = [(i, j.next(), ' '.join(map(lambda x: x[2], > loremipsum.generate_sentences(600 for i in range(500)] * 100 > rrd = sc.parallelize(rows, 128) > {code} > When I run it with: > {noformat} > /spark-2.1.1-bin-hadoop2.7/bin/spark-submit directory>/writeTest.py > {noformat} > it fails with a 'Java heap space' error: > {noformat} > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. > : java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468) > at > org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} > The data I create here approximates my actual data. The third element of each > tuple should be around 25k, and there are 50k tuples overall. I estimate that > I should have around 1.2G of data. > Why then does it fail? All parts of the system should have enough memory? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits
James Porritt created SPARK-20809: - Summary: PySpark: Java heap space issue despite apparently being within memory limits Key: SPARK-20809 URL: https://issues.apache.org/jira/browse/SPARK-20809 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.1 Environment: Linux x86_64 Reporter: James Porritt I have the following script: {code} import itertools import loremipsum from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession conf = SparkConf().set("spark.cores.max", "16") \ .set("spark.driver.memory", "16g") \ .set("spark.executor.memory", "16g") \ .set("spark.executor.memory_overhead", "16g") \ .set("spark.driver.maxResultsSize", "0") sc = SparkContext(appName="testRDD", conf=conf) ss = SparkSession(sc) j = itertools.cycle(range(8)) rows = [(i, j.next(), ' '.join(map(lambda x: x[2], loremipsum.generate_sentences(600 for i in range(500)] * 100 rrd = sc.parallelize(rows, 128) {code} When I run it with: {noformat} /spark-2.1.1-bin-hadoop2.7/bin/spark-submit /writeTest.py {noformat} it fails with a 'Java heap space' error: {noformat} py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) {noformat} The data I create here approximates my actual data. The third element of each tuple should be around 25k, and there are 50k tuples overall. I estimate that I should have around 1.2G of data. Why then does it fail? All parts of the system should have enough memory? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org