date:20181119

[jira] [Updated] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py

2018-11-19 Thread Sai Varun Reddy Daram (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Varun Reddy Daram updated SPARK-26113:
--
Priority: Critical  (was: Major)

> TypeError: object of type 'NoneType' has no len() in 
> authenticate_and_accum_updates of pyspark/accumulators.py
> --
>
> Key: SPARK-26113
> URL: https://issues.apache.org/jira/browse/SPARK-26113
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Sai Varun Reddy Daram
>Priority: Critical
>
> Machine OS: Ubuntu 16.04.
> Kubernetes: Minikube 
> Kubernetes Version: 1.10.0
> Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) 
> built using standard spark docker build.sh file.
> Steps to replicate:
> 1) Create a spark Session:  
> {code:java}
> // 
> spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate()
> {code}
>  2) Create a sample DataFrame
> {code:java}
> // df=spark_session.createDataFrame([{'a':1}])
> {code}
>  3) Do some operation on this dataframe
> {code:java}
> // df.count(){code}
> I get this output.
> {code:java}
> // Exception happened during processing of request from ('127.0.0.1', 38690)
> Traceback (most recent call last):
> File "/usr/lib/python3.6/socketserver.py", line 317, in 
> _handle_request_noblock
> self.process_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 348, in process_request
> self.finish_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/usr/lib/python3.6/socketserver.py", line 721, in __init__
> self.handle()
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 266, in handle
> poll(authenticate_and_accum_updates)
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 241, in poll
> if func():
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 254, in authenticate_and_accum_updates
> received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
> {code}
> 4) Repeat above step; it won't show the error.
>  
> But now close the session, kill the python terminal or process. and try 
> again, the same happens.
>  
> Something related to https://issues.apache.org/jira/browse/SPARK-26019  ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py

2018-11-19 Thread Sai Varun Reddy Daram (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Varun Reddy Daram updated SPARK-26113:
--
Priority: Major  (was: Critical)

> TypeError: object of type 'NoneType' has no len() in 
> authenticate_and_accum_updates of pyspark/accumulators.py
> --
>
> Key: SPARK-26113
> URL: https://issues.apache.org/jira/browse/SPARK-26113
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Sai Varun Reddy Daram
>Priority: Major
>
> Machine OS: Ubuntu 16.04.
> Kubernetes: Minikube 
> Kubernetes Version: 1.10.0
> Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) 
> built using standard spark docker build.sh file.
> Steps to replicate:
> 1) Create a spark Session:  
> {code:java}
> // 
> spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate()
> {code}
>  2) Create a sample DataFrame
> {code:java}
> // df=spark_session.createDataFrame([{'a':1}])
> {code}
>  3) Do some operation on this dataframe
> {code:java}
> // df.count(){code}
> I get this output.
> {code:java}
> // Exception happened during processing of request from ('127.0.0.1', 38690)
> Traceback (most recent call last):
> File "/usr/lib/python3.6/socketserver.py", line 317, in 
> _handle_request_noblock
> self.process_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 348, in process_request
> self.finish_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/usr/lib/python3.6/socketserver.py", line 721, in __init__
> self.handle()
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 266, in handle
> poll(authenticate_and_accum_updates)
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 241, in poll
> if func():
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 254, in authenticate_and_accum_updates
> received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
> {code}
> 4) Repeat above step; it won't show the error.
>  
> But now close the session, kill the python terminal or process. and try 
> again, the same happens.
>  
> Something related to https://issues.apache.org/jira/browse/SPARK-26019  ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py

2018-11-19 Thread Sai Varun Reddy Daram (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691381#comment-16691381
 ] 

Sai Varun Reddy Daram commented on SPARK-26113:
---

[~hyukjin.kwon] sorry for that, I did not know.

> TypeError: object of type 'NoneType' has no len() in 
> authenticate_and_accum_updates of pyspark/accumulators.py
> --
>
> Key: SPARK-26113
> URL: https://issues.apache.org/jira/browse/SPARK-26113
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Sai Varun Reddy Daram
>Priority: Major
>
> Machine OS: Ubuntu 16.04.
> Kubernetes: Minikube 
> Kubernetes Version: 1.10.0
> Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) 
> built using standard spark docker build.sh file.
> Steps to replicate:
> 1) Create a spark Session:  
> {code:java}
> // 
> spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate()
> {code}
>  2) Create a sample DataFrame
> {code:java}
> // df=spark_session.createDataFrame([{'a':1}])
> {code}
>  3) Do some operation on this dataframe
> {code:java}
> // df.count(){code}
> I get this output.
> {code:java}
> // Exception happened during processing of request from ('127.0.0.1', 38690)
> Traceback (most recent call last):
> File "/usr/lib/python3.6/socketserver.py", line 317, in 
> _handle_request_noblock
> self.process_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 348, in process_request
> self.finish_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/usr/lib/python3.6/socketserver.py", line 721, in __init__
> self.handle()
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 266, in handle
> poll(authenticate_and_accum_updates)
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 241, in poll
> if func():
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 254, in authenticate_and_accum_updates
> received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
> {code}
> 4) Repeat above step; it won't show the error.
>  
> But now close the session, kill the python terminal or process. and try 
> again, the same happens.
>  
> Something related to https://issues.apache.org/jira/browse/SPARK-26019  ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py

2018-11-19 Thread Sai Varun Reddy Daram (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Varun Reddy Daram updated SPARK-26113:
--
Description: 
Machine OS: Ubuntu 16.04.

Kubernetes: Minikube 

Kubernetes Version: 1.10.0

Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) built 
using standard spark docker build.sh file.

Driver is inside pod in kubernetes cluster.

Steps to replicate:

1) Create a spark Session:  
{code:java}
// 
spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate()
{code}
 2) Create a sample DataFrame
{code:java}
// df=spark_session.createDataFrame([{'a':1}])
{code}
 3) Do some operation on this dataframe
{code:java}
// df.count(){code}
I get this output.
{code:java}
// Exception happened during processing of request from ('127.0.0.1', 38690)
Traceback (most recent call last):
File "/usr/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python3.6/socketserver.py", line 721, in __init__
self.handle()
File 
"/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
line 266, in handle
poll(authenticate_and_accum_updates)
File 
"/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
line 241, in poll
if func():
File 
"/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
line 254, in authenticate_and_accum_updates
received_token = self.rfile.read(len(auth_token))
TypeError: object of type 'NoneType' has no len()

{code}
4) Repeat above step; it won't show the error.

 

But now close the session, kill the python terminal or process. and try again, 
the same happens.

 

Something related to https://issues.apache.org/jira/browse/SPARK-26019  ?

  was:
Machine OS: Ubuntu 16.04.

Kubernetes: Minikube 

Kubernetes Version: 1.10.0

Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) built 
using standard spark docker build.sh file.

Steps to replicate:

1) Create a spark Session:  
{code:java}
// 
spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate()
{code}
 2) Create a sample DataFrame
{code:java}
// df=spark_session.createDataFrame([{'a':1}])
{code}
 3) Do some operation on this dataframe
{code:java}
// df.count(){code}
I get this output.
{code:java}
// Exception happened during processing of request from ('127.0.0.1', 38690)
Traceback (most recent call last):
File "/usr/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
self.process_request(request, client_address)
File "/usr/lib/python3.6/socketserver.py", line 348, in process_request
self.finish_request(request, client_address)
File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python3.6/socketserver.py", line 721, in __init__
self.handle()
File 
"/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
line 266, in handle
poll(authenticate_and_accum_updates)
File 
"/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
line 241, in poll
if func():
File 
"/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
line 254, in authenticate_and_accum_updates
received_token = self.rfile.read(len(auth_token))
TypeError: object of type 'NoneType' has no len()

{code}
4) Repeat above step; it won't show the error.

 

But now close the session, kill the python terminal or process. and try again, 
the same happens.

 

Something related to https://issues.apache.org/jira/browse/SPARK-26019  ?


> TypeError: object of type 'NoneType' has no len() in 
> authenticate_and_accum_updates of pyspark/accumulators.py
> --
>
> Key: SPARK-26113
> URL: https://issues.apache.org/jira/browse/SPARK-26113
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Sai Varun Reddy Daram
>Priority: Major
>
> Machine OS: Ubuntu 16.04.
> Kubernetes: Minikube 
> Kubernetes Version: 1.10.0
> Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) 
> built using standard spark docker build.sh file.
> Driver is inside pod in kubernetes cluster.
> Step

[jira] [Created] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)

Sergey Zhemzhitsky created SPARK-26114:
--

 Summary: Memory leak of PartitionedPairBuffer when coalescing 
after repartitionAndSortWithinPartitions
 Key: SPARK-26114
 URL: https://issues.apache.org/jira/browse/SPARK-26114
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.2, 2.2.2
 Environment: Spark 3.0.0-SNAPSHOT (master branch)
Scala 2.11
Yarn 2.7
Reporter: Sergey Zhemzhitsky


Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Attachment: run1-noparams-dominator-tree.png

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Attachments: run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Attachment: run1-noparams-dominator-tree-externalsorter.png

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Attachments: run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py

2018-11-19 Thread Sai Varun Reddy Daram (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691330#comment-16691330
 ] 

Sai Varun Reddy Daram edited comment on SPARK-26113 at 11/19/18 9:33 AM:
-

Something to help here: https://issues.apache.org/jira/browse/SPARK-26019 ?


was (Author: saivarunvishal):
Something to help here: https://issues.apache.org/jira/browse/SPARK-26113 ?

> TypeError: object of type 'NoneType' has no len() in 
> authenticate_and_accum_updates of pyspark/accumulators.py
> --
>
> Key: SPARK-26113
> URL: https://issues.apache.org/jira/browse/SPARK-26113
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Sai Varun Reddy Daram
>Priority: Major
>
> Machine OS: Ubuntu 16.04.
> Kubernetes: Minikube 
> Kubernetes Version: 1.10.0
> Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) 
> built using standard spark docker build.sh file.
> Driver is inside pod in kubernetes cluster.
> Steps to replicate:
> 1) Create a spark Session:  
> {code:java}
> // 
> spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate()
> {code}
>  2) Create a sample DataFrame
> {code:java}
> // df=spark_session.createDataFrame([{'a':1}])
> {code}
>  3) Do some operation on this dataframe
> {code:java}
> // df.count(){code}
> I get this output.
> {code:java}
> // Exception happened during processing of request from ('127.0.0.1', 38690)
> Traceback (most recent call last):
> File "/usr/lib/python3.6/socketserver.py", line 317, in 
> _handle_request_noblock
> self.process_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 348, in process_request
> self.finish_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/usr/lib/python3.6/socketserver.py", line 721, in __init__
> self.handle()
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 266, in handle
> poll(authenticate_and_accum_updates)
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 241, in poll
> if func():
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 254, in authenticate_and_accum_updates
> received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
> {code}
> 4) Repeat above step; it won't show the error.
>  
> But now close the session, kill the python terminal or process. and try 
> again, the same happens.
>  
> Something related to https://issues.apache.org/jira/browse/SPARK-26019  ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-11-19 Thread Mikalai Surta (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikalai Surta updated SPARK-20236:
--
Comment: was deleted

(was: Could you please explain why it doesn't work as I expect in my case?

I create a table with dynamic partitions. Insert data into 2 partitions. Then 
insert overwrite data for 1 partition. Expect to see 2 partitions as a result, 
but got only 1 that overwrite the whole table. Doesn't seem to be the same case 
as [~deepanker] had as I create table with explicit PARTITIONED BY clause.

spark.sparkContext.getConf().getAll()

...

('spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")', ''),

...

sqlContext.sql("CREATE TABLE `debug2` () PARTITIONED BY (inputDate)")

sqlContext.sql("insert into debug2 select * from debug1")

sqlContext.sql("select * from debug2").select('ean', 'inputDate').show(10, 
False)

+-+-+
|ean |inputDate|
+-+-+
|4019238159363|*20181025* |
|3188642344151|*20181026* |
+-+-+

sqlContext.sql("insert overwrite table debug2 select * from debug1 where 
inputDate=='*20181025*'")

+-+-+
|ean |inputDate|
+-+-+
|4019238159363|*20181025* |
+-+-+

*20181026* record is lost.)

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-11-19 Thread Mikalai Surta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691472#comment-16691472
 ] 

Mikalai Surta commented on SPARK-20236:
---

[~cloud_fan] It was my fault, works fine, I deleted my comment as invalid

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Attachment: run1-noparams-dominator-tree-externalsorter-gc-root.png

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Description: 
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.


  was:
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.





> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _c

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Description: 
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.

Here is dominator tree from the heapdump
 !run1-noparams-dominator-tree.png|width=500!
 

  was:
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.



> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Prior

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Sai Varun Reddy Daram (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691462#comment-16691462
 ] 

Sai Varun Reddy Daram commented on SPARK-26019:
---

Any help with https://issues.apache.org/jira/browse/SPARK-26113

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Description: 
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.

Here is dominator tree from the heapdump
 !run1-noparams-dominator-tree.png|width=700!

4 instances of ExternalSorter, although there are only 2 concurrently running 
tasks per executor.
 !run1-noparams-dominator-tree-externalsorter.png|width=700! 

And paths to GC root of the already stopped ExternalSorter.
 !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 

  was:
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.

Here is dominator tree from the heapdump
 !run1-noparams-dominator-tree.png|width=500!
 


> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
>

[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory

2018-11-19 Thread Dave DeCaprio (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691519#comment-16691519
 ] 

Dave DeCaprio commented on SPARK-25380:
---

I would like to comment that we are also seeing this.  200Mb plans are not 
unusual for us.

> Generated plans occupy over 50% of Spark driver memory
> --
>
> Key: SPARK-25380
> URL: https://issues.apache.org/jira/browse/SPARK-25380
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment: Spark 2.3.1 (AWS emr-5.16.0)
>  
>Reporter: Michael Spector
>Priority: Minor
> Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot 
> 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png
>
>
> When debugging an OOM exception during long run of a Spark application (many 
> iterations of the same code) I've found that generated plans occupy most of 
> the driver memory. I'm not sure whether this is a memory leak or not, but it 
> would be helpful if old plans could be purged from memory anyways.
> Attached are screenshots of OOM heap dump opened in JVisualVM.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py

2018-11-19 Thread Sai Varun Reddy Daram (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Varun Reddy Daram resolved SPARK-26113.
---
Resolution: Invalid

> TypeError: object of type 'NoneType' has no len() in 
> authenticate_and_accum_updates of pyspark/accumulators.py
> --
>
> Key: SPARK-26113
> URL: https://issues.apache.org/jira/browse/SPARK-26113
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
>Reporter: Sai Varun Reddy Daram
>Priority: Major
>
> Machine OS: Ubuntu 16.04.
> Kubernetes: Minikube 
> Kubernetes Version: 1.10.0
> Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) 
> built using standard spark docker build.sh file.
> Driver is inside pod in kubernetes cluster.
> Steps to replicate:
> 1) Create a spark Session:  
> {code:java}
> // 
> spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate()
> {code}
>  2) Create a sample DataFrame
> {code:java}
> // df=spark_session.createDataFrame([{'a':1}])
> {code}
>  3) Do some operation on this dataframe
> {code:java}
> // df.count(){code}
> I get this output.
> {code:java}
> // Exception happened during processing of request from ('127.0.0.1', 38690)
> Traceback (most recent call last):
> File "/usr/lib/python3.6/socketserver.py", line 317, in 
> _handle_request_noblock
> self.process_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 348, in process_request
> self.finish_request(request, client_address)
> File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/usr/lib/python3.6/socketserver.py", line 721, in __init__
> self.handle()
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 266, in handle
> poll(authenticate_and_accum_updates)
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 241, in poll
> if func():
> File 
> "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", 
> line 254, in authenticate_and_accum_updates
> received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
> {code}
> 4) Repeat above step; it won't show the error.
>  
> But now close the session, kill the python terminal or process. and try 
> again, the same happens.
>  
> Something related to https://issues.apache.org/jira/browse/SPARK-26019  ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26114:


Assignee: Apache Spark

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Assignee: Apache Spark
>Priority: Major
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_.
> Discussion is 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.
> Here is dominator tree from the heapdump
>  !run1-noparams-dominator-tree.png|width=700!
> 4 instances of ExternalSorter, although there are only 2 concurrently running 
> tasks per executor.
>  !run1-noparams-dominator-tree-externalsorter.png|width=700! 
> And paths to GC root of the already stopped ExternalSorter.
>  !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691575#comment-16691575
 ] 

Apache Spark commented on SPARK-26114:
--

User 'szhem' has created a pull request for this issue:
https://github.com/apache/spark/pull/23083

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_.
> Discussion is 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.
> Here is dominator tree from the heapdump
>  !run1-noparams-dominator-tree.png|width=700!
> 4 instances of ExternalSorter, although there are only 2 concurrently running 
> tasks per executor.
>  !run1-noparams-dominator-tree-externalsorter.png|width=700! 
> And paths to GC root of the already stopped ExternalSorter.
>  !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26114:


Assignee: (was: Apache Spark)

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_.
> Discussion is 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.
> Here is dominator tree from the heapdump
>  !run1-noparams-dominator-tree.png|width=700!
> 4 instances of ExternalSorter, although there are only 2 concurrently running 
> tasks per executor.
>  !run1-noparams-dominator-tree-externalsorter.png|width=700! 
> And paths to GC root of the already stopped ExternalSorter.
>  !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module

2018-11-19 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26115:
--

 Summary: Fix deprecated warnings related to scala 2.12 in core 
module
 Key: SPARK-26115
 URL: https://issues.apache.org/jira/browse/SPARK-26115
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Gengliang Wang


{quote}[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => SparkContext.this.reportHeartBeat()).
[warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
"driver-heartbeater",
[warn]   ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()).
[warn] val providers = Seq(new 
HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++
[warn] ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => Executor.this.reportHeartBeat()).
[warn]   private val heartbeater = new Heartbeater(env.memoryManager, 
reportHeartBeat,
[warn]^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102:
 method childGroup in class ServerBootstrap is deprecated: see corresponding 
Javadoc for more information.
[warn] if (bootstrap != null && bootstrap.childGroup() != null) {
[warn]^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn] new ForkJoinTaskSupport(new ForkJoinPool(8))
[warn] ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59:
 value attemptId in class StageInfo is deprecated (since 2.3.0): Use 
attemptNumber instead
[warn]   def attemptNumber(): Int = attemptId
[warn]  ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn]   def newForkJoinPool(prefix: String, maxThreadNumber: Int): 
SForkJoinPool = {
[warn]  ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn]   override def newThread(pool: SForkJoinPool) =
[warn]^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190:
 type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): 
use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias
[warn] new SForkJoinWorkerThread(pool) {
[warn] ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn] new SForkJoinPool(maxThreadNumber, factory,
[warn] ^
[warn] 10 warnings found
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => SparkContext.this.reportHeartBeat()).
[warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
"driver-heartbeater",
[warn] 
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()).
[warn] val providers = Seq(new 
HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++
[warn] 
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102:
 method childGroup in class S

[jira] [Resolved] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module

2018-11-19 Thread Gengliang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-26115.

Resolution: Duplicate

Duplicated to SPARK-2609

> Fix deprecated warnings related to scala 2.12 in core module
> 
>
> Key: SPARK-26115
> URL: https://issues.apache.org/jira/browse/SPARK-26115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> {quote}[warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => SparkContext.this.reportHeartBeat()).
> [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
> "driver-heartbeater",
> [warn]   ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()).
> [warn] val providers = Seq(new 
> HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++
> [warn] ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => Executor.this.reportHeartBeat()).
> [warn]   private val heartbeater = new Heartbeater(env.memoryManager, 
> reportHeartBeat,
> [warn]^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102:
>  method childGroup in class ServerBootstrap is deprecated: see corresponding 
> Javadoc for more information.
> [warn] if (bootstrap != null && bootstrap.childGroup() != null) {
> [warn]^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn] new ForkJoinTaskSupport(new ForkJoinPool(8))
> [warn] ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59:
>  value attemptId in class StageInfo is deprecated (since 2.3.0): Use 
> attemptNumber instead
> [warn]   def attemptNumber(): Int = attemptId
> [warn]  ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn]   def newForkJoinPool(prefix: String, maxThreadNumber: Int): 
> SForkJoinPool = {
> [warn]  ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn]   override def newThread(pool: SForkJoinPool) =
> [warn]^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190:
>  type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): 
> use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias
> [warn] new SForkJoinWorkerThread(pool) {
> [warn] ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn] new SForkJoinPool(maxThreadNumber, factory,
> [warn] ^
> [warn] 10 warnings found
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => SparkContext.this.reportHeartBeat()).
> [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
> "driver-heartbeater",
> [warn] 
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenM

[jira] [Created] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)

Pierre Lienhart created SPARK-26116:
---

 Summary: Spark SQL - Sort when writing partitioned parquet lead to 
OOM errors
 Key: SPARK-26116
 URL: https://issues.apache.org/jira/browse/SPARK-26116
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Pierre Lienhart


When writing partitioned parquet using ```partitionBy```, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

```

Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.

```

```

Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more

 ```

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk (```INFO UnsafeExternalSorter: Thread 155 spilling sort 
data of 3.6 GB to disk```). Sometimes the data is spilled in time to the disk 
and the sort completes (```INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.```) but sometimes it does not and we see 
multiple ```TaskMemoryManager: Failed to allocate a page (67108864 bytes), try 
again.``` until the application finally runs OOM with logs such as ```ERROR 
UnsafeExternalSorter: Unable to grow the pointer array```.

I should mention t

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {code:java}partitionBy{code}, it looks 
like Spark sorts each partition before writing but this sort consumes a huge 
amount of memory compared to the size of the data. The executors can then go 
OOM and get killed by YARN. As a consequence, it also force to provision huge 
amount of memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling 
sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to 
the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting 
complete. Writing out partition files one at a time.{code}) but sometimes it 
does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a 
page (67108864 bytes), try again.{code} until the application finally runs OOM 
with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the 
pointer array{code}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling 
sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to 
the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting 
complete. Writing out partition files one at a time.{code}) but sometimes it 
does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a 
page (67108864 bytes), try again.{code} until the application finally runs OOM 
with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the 
pointer array{code}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks lik

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : 
[Spark-12546|https://issues.apache

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related a

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related a

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related a

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related a

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 
In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related and led

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Summary: Spark SQL - Sort when writing partitioned parquet leads to OOM 
errors  (was: Spark SQL - Sort when writing partitioned parquet lead to OOM 
errors)

> Spark SQL - Sort when writing partitioned parquet leads to OOM errors
> -
>
> Key: SPARK-26116
> URL: https://issues.apache.org/jira/browse/SPARK-26116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pierre Lienhart
>Priority: Major
>
> When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
> sorts each partition before writing but this sort consumes a huge amount of 
> memory compared to the size of the data. The executors can then go OOM and 
> get killed by YARN. As a consequence, it also force to provision huge amount 
> of memory compared to the data to be written.
> Error messages found in the Spark UI are like the following :
> {code:java}
> Spark UI description of failure : Job aborted due to stage failure: Task 169 
> in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 
> 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure 
> (executor 1 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
>  
> {code:java}
> Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
> executor 1): org.apache.spark.SparkException: Task failed while writing rows
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
> /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
>  (No such file or directory)
>  at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)
>  at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)
>  at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)
>  at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>  ... 8 more{code}
>  
> In the stderr logs, we can see that huge amount of sort data (the partition 
> being sorted here is 250 MB when persisted into memory, deserialized) is 
> being spilled to the disk ({{INFO UnsafeExternalSo

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : Spark-12546 is explicitly related a

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also forces to provision huge amounts of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 
In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related and l

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also forces to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 
In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related and le

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Description: 
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memoryOverhead=512 \
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.

Here is dominator tree from the heapdump
 !run1-noparams-dominator-tree.png|width=700!

4 instances of ExternalSorter, although there are only 2 concurrently running 
tasks per executor.
 !run1-noparams-dominator-tree-externalsorter.png|width=700! 

And paths to GC root of the already stopped ExternalSorter.
 !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 

  was:
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.

Here is dominator tree from the heapdump
 !run1-noparams-dominator-tree.png|width=700!

4 instances of ExternalSorter, although there are only 2 concurrently running 
tasks per executor.
 !run1-noparams-dominator-tree-externalsorter.png|width=700!

[jira] [Created] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread caoxuewen (JIRA)

caoxuewen created SPARK-26117:
-

 Summary: use SparkOutOfMemoryError instead of OutOfMemoryError 
when catch exception
 Key: SPARK-26117
 URL: https://issues.apache.org/jira/browse/SPARK-26117
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
entire executor when an OutOfMemoryError is thrown.
so apply for memory using MemoryConsumer. allocatePage when catch exception, 
use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26117:


Assignee: Apache Spark

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26117:


Assignee: (was: Apache Spark)

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Priority: Major
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26026.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23069
[https://github.com/apache/spark/pull/23069]

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26043.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23066
[https://github.com/apache/spark/pull/23066]

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26043:
-

Assignee: Sean Owen

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691725#comment-16691725
 ] 

Apache Spark commented on SPARK-26117:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23084

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Priority: Major
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26068.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23040
[https://github.com/apache/spark/pull/23040]

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Priority: Major
> Fix For: 3.0.0
>
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26068:
---

Assignee: Liu, Linhong

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Assignee: Liu, Linhong
>Priority: Major
> Fix For: 3.0.0
>
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26112) Update since versions of new built-in functions.

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26112:
---

Assignee: Takuya Ueshin

> Update since versions of new built-in functions.
> 
>
> Key: SPARK-26112
> URL: https://issues.apache.org/jira/browse/SPARK-26112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> The following 5 functions were removed from branch-2.4:
> - map_entries
> - map_filter
> - transform_values
> - transform_keys
> - map_zip_with
> We should update the since version to 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26024.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23025
[https://github.com/apache/spark/pull/23025]

> Dataset API: repartitionByRange(...) has inconsistent behaviour
> ---
>
> Key: SPARK-26024
> URL: https://issues.apache.org/jira/browse/SPARK-26024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Spark version 2.3.2
>Reporter: Julien Peloton
>Priority: Major
>  Labels: dataFrame, partitioning, repartition, spark-sql
> Fix For: 3.0.0
>
>
> Hi,
> I recently played with the {{repartitionByRange}} method for DataFrame 
> introduced in SPARK-22614. For DataFrames larger than the one tested in the 
> code (which has only 10 elements), the code sends back random results.
> As a test for showing the inconsistent behaviour, I start as the unit code 
> used to test {{repartitionByRange}} 
> ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352])
>  but I increase the size of the initial array to 1000, repartition using 3 
> partitions, and count the number of element per-partitions:
>  
> {code}
> // Shuffle numbers from 0 to 1000, and make a DataFrame
> val df = Random.shuffle(0.to(1000)).toDF("val")
> // Repartition it using 3 partitions
> // Sum up number of elements in each partition, and collect it.
> // And do it several times
> for (i <- 0 to 9) {
>   var counts = df.repartitionByRange(3, col("val"))
>     .mapPartitions{part => Iterator(part.size)}
> .collect()
>   println(counts.toList)
> }
> // -> the number of elements in each partition varies...
> {code}
> I do not know whether it is expected (I will dig further in the code), but it 
> sounds like a bug.
>  Or I just misinterpret what {{repartitionByRange}} is for?
>  Any ideas?
> Thanks!
>  Julien



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26024:
---

Assignee: Julien Peloton

> Dataset API: repartitionByRange(...) has inconsistent behaviour
> ---
>
> Key: SPARK-26024
> URL: https://issues.apache.org/jira/browse/SPARK-26024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Spark version 2.3.2
>Reporter: Julien Peloton
>Assignee: Julien Peloton
>Priority: Major
>  Labels: dataFrame, partitioning, repartition, spark-sql
> Fix For: 3.0.0
>
>
> Hi,
> I recently played with the {{repartitionByRange}} method for DataFrame 
> introduced in SPARK-22614. For DataFrames larger than the one tested in the 
> code (which has only 10 elements), the code sends back random results.
> As a test for showing the inconsistent behaviour, I start as the unit code 
> used to test {{repartitionByRange}} 
> ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352])
>  but I increase the size of the initial array to 1000, repartition using 3 
> partitions, and count the number of element per-partitions:
>  
> {code}
> // Shuffle numbers from 0 to 1000, and make a DataFrame
> val df = Random.shuffle(0.to(1000)).toDF("val")
> // Repartition it using 3 partitions
> // Sum up number of elements in each partition, and collect it.
> // And do it several times
> for (i <- 0 to 9) {
>   var counts = df.repartitionByRange(3, col("val"))
>     .mapPartitions{part => Iterator(part.size)}
> .collect()
>   println(counts.toList)
> }
> // -> the number of elements in each partition varies...
> {code}
> I do not know whether it is expected (I will dig further in the code), but it 
> sounds like a bug.
>  Or I just misinterpret what {{repartitionByRange}} is for?
>  Any ideas?
> Thanks!
>  Julien



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26112) Update since versions of new built-in functions.

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26112.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23082
[https://github.com/apache/spark/pull/23082]

> Update since versions of new built-in functions.
> 
>
> Key: SPARK-26112
> URL: https://issues.apache.org/jira/browse/SPARK-26112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> The following 5 functions were removed from branch-2.4:
> - map_entries
> - map_filter
> - transform_values
> - transform_keys
> - map_zip_with
> We should update the since version to 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-19 Thread Andrew Otto (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691787#comment-16691787
 ] 

Andrew Otto commented on SPARK-14492:
-

I found my issue: we were loading some Hive 1.1.0 jars manually using 
spark.driver.extraClassPath in order to initiate a Hive JDBC directly to Hive, 
instead of using spark.sql() to work around 
https://issues.apache.org/jira/browse/SPARK-23890.  The Hive 1.1.0 classes were 
loaded before the ones included with Spark, and as such they failed referencing 
a Hive configuration that didn't exist in 1.1.0.

 

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26071) disallow map as map key

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26071.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23045
[https://github.com/apache/spark/pull/23045]

> disallow map as map key
> ---
>
> Key: SPARK-26071
> URL: https://issues.apache.org/jira/browse/SPARK-26071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25528) data source V2 API refactoring (batch read)

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691799#comment-16691799
 ] 

Apache Spark commented on SPARK-25528:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23086

> data source V2 API refactoring (batch read)
> ---
>
> Key: SPARK-25528
> URL: https://issues.apache.org/jira/browse/SPARK-25528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> refactor the read side API according to this abstraction
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25528) data source V2 API refactoring (batch read)

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25528:

Summary: data source V2 API refactoring (batch read)  (was: data source V2 
read side API refactoring)

> data source V2 API refactoring (batch read)
> ---
>
> Key: SPARK-25528
> URL: https://issues.apache.org/jira/browse/SPARK-25528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> refactor the read side API according to this abstraction
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-19 Thread Sunil Rangwani (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691811#comment-16691811
 ] 

Sunil Rangwani commented on SPARK-14492:


[~srowen]

Please reopen this.

It is not about building with varying versions of Hive. After building with the 
required version of Hive, it gives a runtime error - as explained in my earlier 
comment. 

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-19 Thread Sunil Rangwani (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691828#comment-16691828
 ] 

Sunil Rangwani commented on SPARK-14492:


[~ottomata] you are using a CDH version I believe. I don't know what 
dependencies their parcels include.

This bug was originally found in the open source version built using maven as 
described here 
[https://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support]
 

 

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26090.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23065
[https://github.com/apache/spark/pull/23065]

> Resolve most miscellaneous deprecation and build warnings for Spark 3
> -
>
> Key: SPARK-26090
> URL: https://issues.apache.org/jira/browse/SPARK-26090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The build has a lot of deprecation warnings. Some are new in Scala 2.12 and 
> Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy 
> miscellaneous ones here.
> They're too numerous and small to list here; see the pull request. Some 
> highlights:
> - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in 
> Java. Instead, case classes can explicitly declare getters
> - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases
> - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0
> - finalize() is finally deprecated (just needs to be suppressed)
> - StageInfo.attempId was deprecated and easiest to remove here
> I'm not now going to touch some chunks of deprecation warnings:
> - Parquet deprecations
> - Hive deprecations (particularly serde2 classes)
> - Deprecations in generated code (mostly Thriftserver CLI)
> - ProcessingTime deprecations (we may need to revive this class as internal)
> - many MLlib deprecations because they concern methods that may be removed 
> anyway
> - a few Kinesis deprecations I couldn't figure out
> - Mesos get/setRole, which I don't know well
> - Kafka/ZK deprecations (e.g. poll())
> - Kinesis
> - a few other ones that will probably resolve by deleting a deprecated method



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23886) update query.status

2018-11-19 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691881#comment-16691881
 ] 

Gabor Somogyi commented on SPARK-23886:
---

Thanks [~efim.poberezkin]! May I ask you then to close your PRs for this Jira 
and for SPARK-24063?

> update query.status
> ---
>
> Key: SPARK-23886
> URL: https://issues.apache.org/jira/browse/SPARK-23886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-19 Thread Long Cao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691965#comment-16691965
 ] 

Long Cao commented on SPARK-26026:
--

[~srowen] thanks for the expedient resolution! It's a small thing to me but 
nice to see.

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)

Attila Zsolt Piros created SPARK-26118:
--

 Summary: Make Jetty's requestHeaderSize configurable in Spark
 Key: SPARK-26118
 URL: https://issues.apache.org/jira/browse/SPARK-26118
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0, 2.3.2, 2.2.2, 2.1.3, 3.0.0
Reporter: Attila Zsolt Piros


For long authorization fields the request header size could be over the default 
limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request Entity Too 
Large).

This issue may occur if the user is a member of many Active Directory user 
groups.

The HTTP request to the server contains the Kerberos token in the 
WWW-Authenticate header. The header size increases together with the number of 
user groups. 

Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26118:
---
Affects Version/s: (was: 2.3.2)
   (was: 2.4.0)
   (was: 2.2.2)
   (was: 2.1.3)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692018#comment-16692018
 ] 

Attila Zsolt Piros commented on SPARK-26118:


I am working on a PR.

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread ABHISHEK KUMAR GUPTA (JIRA)

ABHISHEK KUMAR GUPTA created SPARK-26119:


 Summary: Task metrics summary in the stage page should contain 
only successful tasks metrics
 Key: SPARK-26119
 URL: https://issues.apache.org/jira/browse/SPARK-26119
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.3.2
Reporter: ABHISHEK KUMAR GUPTA


Currently task metrics summary table in the stage tab shows summary corresponds 
to all the tasks. But, we should display the summary of only succeeded tasks in 
the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692093#comment-16692093
 ] 

shahid commented on SPARK-26119:


Thanks. I am working on it

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26120:
-
Component/s: Structured Streaming
 SparkR

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692175#comment-16692175
 ] 

Apache Spark commented on SPARK-26119:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23088

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26120:
-
Priority: Minor  (was: Major)

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26120:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26119:


Assignee: (was: Apache Spark)

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692176#comment-16692176
 ] 

Apache Spark commented on SPARK-26120:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23089

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.o

[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26119:


Assignee: Apache Spark

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692178#comment-16692178
 ] 

Apache Spark commented on SPARK-26120:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23089

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.o

[jira] [Created] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-26120:


 Summary: Fix a streaming query leak in Structured Streaming R tests
 Key: SPARK-26120
 URL: https://issues.apache.org/jira/browse/SPARK-26120
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


"Specify a schema by using a DDL-formatted string when reading" doesn't stop 
the streaming query before stopping Spark. It causes the following annoying 
logs.

{code}
Exception in thread "stream execution thread for [id = 
186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
... 7 more
org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
... 7 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26120:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26121) [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id)

2018-11-19 Thread Anastasios Zouzias (JIRA)

Anastasios Zouzias created SPARK-26121:
--

 Summary: [Structured Streaming] Allow users to define prefix of 
Kafka's consumer group (group.id)
 Key: SPARK-26121
 URL: https://issues.apache.org/jira/browse/SPARK-26121
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Anastasios Zouzias


I run in the following situation with Spark Structure Streaming (SS) using 
Kafka.
 
In a project that I work on, there is already a secured Kafka setup where ops 
can issue an SSL certificate per "[group.id|http://group.id/]";, which should be 
predefined (or its prefix to be predefined).
 
On the other hand, Spark SS fixes the [group.id|http://group.id/] to 
 
val uniqueGroupId = 
s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
 
see, i.e.,
 
[https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124]
 
https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81
 
I guess Spark developers had a good reason to fix it, but is it possible to 
make configurable the prefix of the above uniqueGroupId ("spark-kafka-source")?
 
The rational is that spark users are not forced to use the same certificate on 
group-ids of the form (spark-kafka-source-*).
 
DoD:
* Allow spark SS users to define the group.id prefix as input parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-26019:
---

Reproduced myself 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692219#comment-16692219
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~hyukjin.kwon] today I reproduced this first time .. but we still receive 
reports from other our users as well. 

Here's code on Spark 2.3.2 + Python 2.7.15.

Execute on a freshly created Spark session :

{code:python}

def python_major_version ():
import sys
return(sys.version_info[0])


print(python_major_version())

print(sc.parallelize([1]).map(lambda x: python_major_version()).collect())# 
error happens here !

{code}

It always reproduces for me.

Notice that just rerunning the same code makes this error disappear.



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692226#comment-16692226
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Might be broken by https://github.com/apache/spark/pull/22635 change 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692233#comment-16692233
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Sorry, nope it was broken by this change - 
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650#diff-c3339bbf2b850b79445b41e9eecf57c4R249
 



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692237#comment-16692237
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

cc [~lucacanali] 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26118:


Assignee: (was: Apache Spark)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692246#comment-16692246
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23090

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692248#comment-16692248
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23090

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26118:


Assignee: Apache Spark

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-26122:
--

 Summary: Support encoding for multiLine in CSV datasource
 Key: SPARK-26122
 URL: https://issues.apache.org/jira/browse/SPARK-26122
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, CSV datasource is not able to read CSV files in different encoding 
when multiLine is enabled. The ticket aims to support the encoding CSV options 
in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692305#comment-16692305
 ] 

Apache Spark commented on SPARK-26122:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23091

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26122:


Assignee: Apache Spark

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692306#comment-16692306
 ] 

Apache Spark commented on SPARK-26122:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23091

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26122:


Assignee: (was: Apache Spark)

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692355#comment-16692355
 ] 

Imran Rashid commented on SPARK-26019:
--

[~hyukjin.kwon] you think maybe these two lines need to be swapped?

https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L274-L275

any change you can try with that change [~Tagar] since you've got a handy 
environment to reproduce?

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692377#comment-16692377
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~irashid] thanks a lot for looking at this ! 
It makes sense two swap those two lines to call parent class constructor after 
auth_token has been initialized. 

We're using Cloudera's Spark, and pyspark dependencies are inside of a zip 
file, in a "immutable" parcel... 
Unfortunately there is no quick way to test it  as it has to be propagated into 
all worker nodes. [~bersprockets] any ideas how to test this? 

Thank you.


> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692380#comment-16692380
 ] 

Apache Spark commented on SPARK-26094:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/23092

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26094:


Assignee: Imran Rashid  (was: Apache Spark)

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26094:


Assignee: Apache Spark  (was: Imran Rashid)

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692382#comment-16692382
 ] 

Apache Spark commented on SPARK-26094:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/23092

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26123) Make all MLlib/ML Models public

2018-11-19 Thread Nicholas Resnick (JIRA)

Nicholas Resnick created SPARK-26123:


 Summary: Make all MLlib/ML Models public
 Key: SPARK-26123
 URL: https://issues.apache.org/jira/browse/SPARK-26123
 Project: Spark
  Issue Type: Wish
  Components: ML, MLlib
Affects Versions: 2.4.0
Reporter: Nicholas Resnick


Can all Model subclasses be made public? It's very difficult to make custom 
Models that build off of MLlib/ML Models otherwise. I also can't think of a 
reason why Estimators should be public, but not Models.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692495#comment-16692495
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

I don't understand how the swapping two lines solve this issue. Also, I don't 
get how it happens and how to reproduce this.
[~Tagar], It's you who reported the issue and it looks odd to ask an arbitrary 
guy for an idea to test it in Apache JIRA.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26019.
--
Resolution: Cannot Reproduce

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2018-11-19 Thread Pawan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pawan updated SPARK-25919:
--
Affects Version/s: 2.1.0

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Major
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692498#comment-16692498
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

Please reopen if this is still being reproduced. Or if there's any analysis for 
this issue about why this happens. Swapping two lines should be fine but I 
don't understand why and how.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692504#comment-16692504
 ] 

Liang-Chi Hsieh commented on SPARK-26019:
-

Isn't the server beginning to handle requests after calling {{serve_forever}}? 
It is called after {{AccumulatorServer}} initialization.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 119 matches

Mail list logo