[jira] [Created] (SPARK-24559) Some zip files passed with spark-submit --archives causing "invalid CEN header" error

2018-06-14 Thread James Porritt (JIRA)
James Porritt created SPARK-24559:
-

 Summary: Some zip files passed with spark-submit --archives 
causing "invalid CEN header" error
 Key: SPARK-24559
 URL: https://issues.apache.org/jira/browse/SPARK-24559
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.0
Reporter: James Porritt


I'm encountering an error when submitting some zip files to spark-submit using 
--archive that are over 2Gb and have the zip64 flag set.

{{PYSPARK_PYTHON=./ROOT/myspark/bin/python 
/usr/hdp/current/spark2-client/bin/spark-submit \}}
{{ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./ROOT/myspark/bin/python \}}
{{ --master=yarn \}}
{{ --deploy-mode=cluster \}}
{{ --driver-memory=4g \}}
{{ --archives=myspark.zip#ROOT \}}
{{ --num-executors=32 \}}
{{ --packages com.databricks:spark-avro_2.11:4.0.0 \}}
{{ foo.py}}

(As a bit of background, I'm trying to prepare files using the trick of zipping 
a conda environment and passing the zip file via --archives, as per: 
https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html)

myspark.zip is a zipped conda environment. It was created using python with the 
zipfile pacakge. The files are stored without deflation and with the zip64 flag 
set. foo.py is my application code. This normally works, but if myspark.zip is 
greater than 2Gb and has the zip64 flag set I get:

java.util.zip.ZipException: invalid CEN header (bad signature)

There seems to be much written on the subject, and I was able to write Java 
code that utilises the java.util.zip library that both does and doesn't 
encounter this error for one of the problematic zip files.

Spark compile info:

{{Welcome to}}
{{  __}}
{{ / __/__ ___ _/ /__}}
{{ _\ \/ _ \/ _ `/ __/ '_/}}
{{ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0.2.6.4.0-91}}
{{ /_/}}

{{Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_112}}
{{Branch HEAD}}
{{Compiled by user jenkins on 2018-01-04T10:41:05Z}}
{{Revision a24017869f5450397136ee8b11be818e7cd3facb}}
{{Url g...@github.com:hortonworks/spark2.git}}
{{Type --help for more information.}}

YARN logs on console after above command. I've tried both --deploy-mode=cluster 
and --deploy-mode=client.

{{18/06/13 16:00:22 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable}}
{{18/06/13 16:00:23 WARN DomainSocketFactory: The short-circuit local reads 
feature cannot be used because libhadoop cannot be loaded.}}
{{18/06/13 16:00:23 INFO RMProxy: Connecting to ResourceManager at 
myhost2.myfirm.com/10.87.11.17:8050}}
{{18/06/13 16:00:23 INFO Client: Requesting a new application from cluster with 
6 NodeManagers}}
{{18/06/13 16:00:23 INFO Client: Verifying our application has not requested 
more than the maximum memory capability of the cluster (221184 MB per 
container)}}
{{18/06/13 16:00:23 INFO Client: Will allocate AM container, with 18022 MB 
memory including 1638 MB overhead}}
{{18/06/13 16:00:23 INFO Client: Setting up container launch context for our 
AM}}
{{18/06/13 16:00:23 INFO Client: Setting up the launch environment for our AM 
container}}
{{18/06/13 16:00:23 INFO Client: Preparing resources for our AM container}}
{{18/06/13 16:00:24 INFO Client: Use hdfs cache file as spark.yarn.archive for 
HDP, 
hdfsCacheFile:hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/06/13 16:00:24 INFO Client: Source and destination file systems are the 
same. Not copying 
hdfs://myhost.myfirm.com:8020/hdp/apps/2.6.4.0-91/spark2/spark2-hdp-yarn-archive.tar.gz}}
{{18/06/13 16:00:24 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/com.databricks_spark-avro_2.11-4.0.0.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/com.databri}}
{{cks_spark-avro_2.11-4.0.0.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.slf4j_slf4j-api-1.}}
{{7.5.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org.apache.avro_avro-}}
{{1.7.6.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar 
-> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/org}}
{{.codehaus.jackson_jackson-core-asl-1.9.13.jar}}
{{18/06/13 16:00:26 INFO Client: Uploading resource 
file:/home/myuser/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar 
-> 
hdfs://myhost.myfirm.com:8020/user/myuser/.sparkStaging/application_1528901858967_0019/o}}

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-09 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however, I can't seem to reduce it to a 
sample. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Another error is:
{noformat}
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
385, in getNumPartitions
AttributeError: 'NoneType' object has no attribute 'size'
{noformat}

This is happening at multiple points in my code.





  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however, I can't seem to reduce it to a 
sample. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however, I can't seem to reduce it to a 
sample. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File 

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code:python}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File 

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'}}

Another error is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File 

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code:python}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File 

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'}}

Another error is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)}}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

x = a.subtract(b)
y = b.subtract(a)

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. The error I will get is:

  File "", line 642, in 
if y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'

Sometimes the error will complain about it not having a 'size' parameter.




> subtract creating empty DataFrame that isn't initialised properly 
> --
>
> Key: SPARK-22468
> URL: https://issues.apache.org/jira/browse/SPARK-22468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: James Porritt
>
> I have an issue whereby a subtract between two DataFrames that will correctly 
> end up with an empty DataFrame, seemingly has the 

[jira] [Created] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)
James Porritt created SPARK-22468:
-

 Summary: subtract creating empty DataFrame that isn't initialised 
properly 
 Key: SPARK-22468
 URL: https://issues.apache.org/jira/browse/SPARK-22468
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: James Porritt


I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

x = a.subtract(b)
y = b.subtract(a)

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. The error I will get is:

  File "", line 642, in 
if y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'

Sometimes the error will complain about it not having a 'size' parameter.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits

2017-05-23 Thread James Porritt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt resolved SPARK-20809.
---
Resolution: Fixed

Solution was to specify -driver-memory on the command line.

> PySpark: Java heap space issue despite apparently being within memory limits
> 
>
> Key: SPARK-20809
> URL: https://issues.apache.org/jira/browse/SPARK-20809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Linux x86_64
>Reporter: James Porritt
>
> I have the following script:
> {code}
> import itertools
> import loremipsum
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> conf = SparkConf().set("spark.cores.max", "16") \
> .set("spark.driver.memory", "16g") \
> .set("spark.executor.memory", "16g") \
> .set("spark.executor.memory_overhead", "16g") \
> .set("spark.driver.maxResultsSize", "0")
> sc = SparkContext(appName="testRDD", conf=conf)
> ss = SparkSession(sc)
> j = itertools.cycle(range(8))
> rows = [(i, j.next(), ' '.join(map(lambda x: x[2], 
> loremipsum.generate_sentences(600 for i in range(500)] * 100
> rrd = sc.parallelize(rows, 128)
> {code}
> When I run it with:
> {noformat}
> /spark-2.1.1-bin-hadoop2.7/bin/spark-submit  directory>/writeTest.py
> {noformat}
> it fails with a 'Java heap space' error:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
> : java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468)
> at 
> org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The data I create here approximates my actual data. The third element of each 
> tuple should be around 25k, and there are 50k tuples overall. I estimate that 
> I should have around 1.2G of data. 
> Why then does it fail? All parts of the system should have enough memory?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits

2017-05-23 Thread James Porritt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16020744#comment-16020744
 ] 

James Porritt commented on SPARK-20809:
---

Many thanks, this put me on track for the solution. I needed to put 
--driver-memory=16g rather than set it in the code.

I'd done some tests on the sentence generator and worked out how to get it to 
give me a 25K string, which multiplied by 50,000 is about 1.2G.

> PySpark: Java heap space issue despite apparently being within memory limits
> 
>
> Key: SPARK-20809
> URL: https://issues.apache.org/jira/browse/SPARK-20809
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.1
> Environment: Linux x86_64
>Reporter: James Porritt
>
> I have the following script:
> {code}
> import itertools
> import loremipsum
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> conf = SparkConf().set("spark.cores.max", "16") \
> .set("spark.driver.memory", "16g") \
> .set("spark.executor.memory", "16g") \
> .set("spark.executor.memory_overhead", "16g") \
> .set("spark.driver.maxResultsSize", "0")
> sc = SparkContext(appName="testRDD", conf=conf)
> ss = SparkSession(sc)
> j = itertools.cycle(range(8))
> rows = [(i, j.next(), ' '.join(map(lambda x: x[2], 
> loremipsum.generate_sentences(600 for i in range(500)] * 100
> rrd = sc.parallelize(rows, 128)
> {code}
> When I run it with:
> {noformat}
> /spark-2.1.1-bin-hadoop2.7/bin/spark-submit  directory>/writeTest.py
> {noformat}
> it fails with a 'Java heap space' error:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
> : java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468)
> at 
> org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The data I create here approximates my actual data. The third element of each 
> tuple should be around 25k, and there are 50k tuples overall. I estimate that 
> I should have around 1.2G of data. 
> Why then does it fail? All parts of the system should have enough memory?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits

2017-05-19 Thread James Porritt (JIRA)
James Porritt created SPARK-20809:
-

 Summary: PySpark: Java heap space issue despite apparently being 
within memory limits
 Key: SPARK-20809
 URL: https://issues.apache.org/jira/browse/SPARK-20809
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.1
 Environment: Linux x86_64
Reporter: James Porritt


I have the following script:

{code}
import itertools
import loremipsum
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

conf = SparkConf().set("spark.cores.max", "16") \
.set("spark.driver.memory", "16g") \
.set("spark.executor.memory", "16g") \
.set("spark.executor.memory_overhead", "16g") \
.set("spark.driver.maxResultsSize", "0")

sc = SparkContext(appName="testRDD", conf=conf)
ss = SparkSession(sc)

j = itertools.cycle(range(8))
rows = [(i, j.next(), ' '.join(map(lambda x: x[2], 
loremipsum.generate_sentences(600 for i in range(500)] * 100
rrd = sc.parallelize(rows, 128)
{code}

When I run it with:
{noformat}
/spark-2.1.1-bin-hadoop2.7/bin/spark-submit /writeTest.py
{noformat}

it fails with a 'Java heap space' error:

{noformat}
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468)
at 
org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The data I create here approximates my actual data. The third element of each 
tuple should be around 25k, and there are 50k tuples overall. I estimate that I 
should have around 1.2G of data. 

Why then does it fail? All parts of the system should have enough memory?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org