[jira] [Updated] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py
[ https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Varun Reddy Daram updated SPARK-26113: -- Priority: Critical (was: Major) > TypeError: object of type 'NoneType' has no len() in > authenticate_and_accum_updates of pyspark/accumulators.py > -- > > Key: SPARK-26113 > URL: https://issues.apache.org/jira/browse/SPARK-26113 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Sai Varun Reddy Daram >Priority: Critical > > Machine OS: Ubuntu 16.04. > Kubernetes: Minikube > Kubernetes Version: 1.10.0 > Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) > built using standard spark docker build.sh file. > Steps to replicate: > 1) Create a spark Session: > {code:java} > // > spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate() > {code} > 2) Create a sample DataFrame > {code:java} > // df=spark_session.createDataFrame([{'a':1}]) > {code} > 3) Do some operation on this dataframe > {code:java} > // df.count(){code} > I get this output. > {code:java} > // Exception happened during processing of request from ('127.0.0.1', 38690) > Traceback (most recent call last): > File "/usr/lib/python3.6/socketserver.py", line 317, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 348, in process_request > self.finish_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/python3.6/socketserver.py", line 721, in __init__ > self.handle() > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 266, in handle > poll(authenticate_and_accum_updates) > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 241, in poll > if func(): > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 254, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > {code} > 4) Repeat above step; it won't show the error. > > But now close the session, kill the python terminal or process. and try > again, the same happens. > > Something related to https://issues.apache.org/jira/browse/SPARK-26019 ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py
[ https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Varun Reddy Daram updated SPARK-26113: -- Priority: Major (was: Critical) > TypeError: object of type 'NoneType' has no len() in > authenticate_and_accum_updates of pyspark/accumulators.py > -- > > Key: SPARK-26113 > URL: https://issues.apache.org/jira/browse/SPARK-26113 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Sai Varun Reddy Daram >Priority: Major > > Machine OS: Ubuntu 16.04. > Kubernetes: Minikube > Kubernetes Version: 1.10.0 > Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) > built using standard spark docker build.sh file. > Steps to replicate: > 1) Create a spark Session: > {code:java} > // > spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate() > {code} > 2) Create a sample DataFrame > {code:java} > // df=spark_session.createDataFrame([{'a':1}]) > {code} > 3) Do some operation on this dataframe > {code:java} > // df.count(){code} > I get this output. > {code:java} > // Exception happened during processing of request from ('127.0.0.1', 38690) > Traceback (most recent call last): > File "/usr/lib/python3.6/socketserver.py", line 317, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 348, in process_request > self.finish_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/python3.6/socketserver.py", line 721, in __init__ > self.handle() > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 266, in handle > poll(authenticate_and_accum_updates) > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 241, in poll > if func(): > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 254, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > {code} > 4) Repeat above step; it won't show the error. > > But now close the session, kill the python terminal or process. and try > again, the same happens. > > Something related to https://issues.apache.org/jira/browse/SPARK-26019 ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py
[ https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691381#comment-16691381 ] Sai Varun Reddy Daram commented on SPARK-26113: --- [~hyukjin.kwon] sorry for that, I did not know. > TypeError: object of type 'NoneType' has no len() in > authenticate_and_accum_updates of pyspark/accumulators.py > -- > > Key: SPARK-26113 > URL: https://issues.apache.org/jira/browse/SPARK-26113 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Sai Varun Reddy Daram >Priority: Major > > Machine OS: Ubuntu 16.04. > Kubernetes: Minikube > Kubernetes Version: 1.10.0 > Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) > built using standard spark docker build.sh file. > Steps to replicate: > 1) Create a spark Session: > {code:java} > // > spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate() > {code} > 2) Create a sample DataFrame > {code:java} > // df=spark_session.createDataFrame([{'a':1}]) > {code} > 3) Do some operation on this dataframe > {code:java} > // df.count(){code} > I get this output. > {code:java} > // Exception happened during processing of request from ('127.0.0.1', 38690) > Traceback (most recent call last): > File "/usr/lib/python3.6/socketserver.py", line 317, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 348, in process_request > self.finish_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/python3.6/socketserver.py", line 721, in __init__ > self.handle() > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 266, in handle > poll(authenticate_and_accum_updates) > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 241, in poll > if func(): > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 254, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > {code} > 4) Repeat above step; it won't show the error. > > But now close the session, kill the python terminal or process. and try > again, the same happens. > > Something related to https://issues.apache.org/jira/browse/SPARK-26019 ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py
[ https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Varun Reddy Daram updated SPARK-26113: -- Description: Machine OS: Ubuntu 16.04. Kubernetes: Minikube Kubernetes Version: 1.10.0 Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) built using standard spark docker build.sh file. Driver is inside pod in kubernetes cluster. Steps to replicate: 1) Create a spark Session: {code:java} // spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate() {code} 2) Create a sample DataFrame {code:java} // df=spark_session.createDataFrame([{'a':1}]) {code} 3) Do some operation on this dataframe {code:java} // df.count(){code} I get this output. {code:java} // Exception happened during processing of request from ('127.0.0.1', 38690) Traceback (most recent call last): File "/usr/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock self.process_request(request, client_address) File "/usr/lib/python3.6/socketserver.py", line 348, in process_request self.finish_request(request, client_address) File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request self.RequestHandlerClass(request, client_address, self) File "/usr/lib/python3.6/socketserver.py", line 721, in __init__ self.handle() File "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", line 266, in handle poll(authenticate_and_accum_updates) File "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", line 241, in poll if func(): File "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", line 254, in authenticate_and_accum_updates received_token = self.rfile.read(len(auth_token)) TypeError: object of type 'NoneType' has no len() {code} 4) Repeat above step; it won't show the error. But now close the session, kill the python terminal or process. and try again, the same happens. Something related to https://issues.apache.org/jira/browse/SPARK-26019 ? was: Machine OS: Ubuntu 16.04. Kubernetes: Minikube Kubernetes Version: 1.10.0 Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) built using standard spark docker build.sh file. Steps to replicate: 1) Create a spark Session: {code:java} // spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate() {code} 2) Create a sample DataFrame {code:java} // df=spark_session.createDataFrame([{'a':1}]) {code} 3) Do some operation on this dataframe {code:java} // df.count(){code} I get this output. {code:java} // Exception happened during processing of request from ('127.0.0.1', 38690) Traceback (most recent call last): File "/usr/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock self.process_request(request, client_address) File "/usr/lib/python3.6/socketserver.py", line 348, in process_request self.finish_request(request, client_address) File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request self.RequestHandlerClass(request, client_address, self) File "/usr/lib/python3.6/socketserver.py", line 721, in __init__ self.handle() File "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", line 266, in handle poll(authenticate_and_accum_updates) File "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", line 241, in poll if func(): File "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", line 254, in authenticate_and_accum_updates received_token = self.rfile.read(len(auth_token)) TypeError: object of type 'NoneType' has no len() {code} 4) Repeat above step; it won't show the error. But now close the session, kill the python terminal or process. and try again, the same happens. Something related to https://issues.apache.org/jira/browse/SPARK-26019 ? > TypeError: object of type 'NoneType' has no len() in > authenticate_and_accum_updates of pyspark/accumulators.py > -- > > Key: SPARK-26113 > URL: https://issues.apache.org/jira/browse/SPARK-26113 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Sai Varun Reddy Daram >Priority: Major > > Machine OS: Ubuntu 16.04. > Kubernetes: Minikube > Kubernetes Version: 1.10.0 > Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) > built using standard spark docker build.sh file. > Driver is inside pod in kubernetes cluster. > Step
[jira] [Created] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
Sergey Zhemzhitsky created SPARK-26114: -- Summary: Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions Key: SPARK-26114 URL: https://issues.apache.org/jira/browse/SPARK-26114 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.3.2, 2.2.2 Environment: Spark 3.0.0-SNAPSHOT (master branch) Scala 2.11 Yarn 2.7 Reporter: Sergey Zhemzhitsky Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_ The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Attachment: run1-noparams-dominator-tree.png > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Priority: Major > Attachments: run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_ > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Attachment: run1-noparams-dominator-tree-externalsorter.png > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Priority: Major > Attachments: run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_ > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py
[ https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691330#comment-16691330 ] Sai Varun Reddy Daram edited comment on SPARK-26113 at 11/19/18 9:33 AM: - Something to help here: https://issues.apache.org/jira/browse/SPARK-26019 ? was (Author: saivarunvishal): Something to help here: https://issues.apache.org/jira/browse/SPARK-26113 ? > TypeError: object of type 'NoneType' has no len() in > authenticate_and_accum_updates of pyspark/accumulators.py > -- > > Key: SPARK-26113 > URL: https://issues.apache.org/jira/browse/SPARK-26113 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Sai Varun Reddy Daram >Priority: Major > > Machine OS: Ubuntu 16.04. > Kubernetes: Minikube > Kubernetes Version: 1.10.0 > Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) > built using standard spark docker build.sh file. > Driver is inside pod in kubernetes cluster. > Steps to replicate: > 1) Create a spark Session: > {code:java} > // > spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate() > {code} > 2) Create a sample DataFrame > {code:java} > // df=spark_session.createDataFrame([{'a':1}]) > {code} > 3) Do some operation on this dataframe > {code:java} > // df.count(){code} > I get this output. > {code:java} > // Exception happened during processing of request from ('127.0.0.1', 38690) > Traceback (most recent call last): > File "/usr/lib/python3.6/socketserver.py", line 317, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 348, in process_request > self.finish_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/python3.6/socketserver.py", line 721, in __init__ > self.handle() > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 266, in handle > poll(authenticate_and_accum_updates) > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 241, in poll > if func(): > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 254, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > {code} > 4) Repeat above step; it won't show the error. > > But now close the session, kill the python terminal or process. and try > again, the same happens. > > Something related to https://issues.apache.org/jira/browse/SPARK-26019 ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions
[ https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikalai Surta updated SPARK-20236: -- Comment: was deleted (was: Could you please explain why it doesn't work as I expect in my case? I create a table with dynamic partitions. Insert data into 2 partitions. Then insert overwrite data for 1 partition. Expect to see 2 partitions as a result, but got only 1 that overwrite the whole table. Doesn't seem to be the same case as [~deepanker] had as I create table with explicit PARTITIONED BY clause. spark.sparkContext.getConf().getAll() ... ('spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")', ''), ... sqlContext.sql("CREATE TABLE `debug2` () PARTITIONED BY (inputDate)") sqlContext.sql("insert into debug2 select * from debug1") sqlContext.sql("select * from debug2").select('ean', 'inputDate').show(10, False) +-+-+ |ean |inputDate| +-+-+ |4019238159363|*20181025* | |3188642344151|*20181026* | +-+-+ sqlContext.sql("insert overwrite table debug2 select * from debug1 where inputDate=='*20181025*'") +-+-+ |ean |inputDate| +-+-+ |4019238159363|*20181025* | +-+-+ *20181026* record is lost.) > Overwrite a partitioned data source table should only overwrite related > partitions > -- > > Key: SPARK-20236 > URL: https://issues.apache.org/jira/browse/SPARK-20236 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: releasenotes > Fix For: 2.3.0 > > > When we overwrite a partitioned data source table, currently Spark will > truncate the entire table to write new data, or truncate a bunch of > partitions according to the given static partitions. > For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, > {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions > that starts with {{a=1}}. > This behavior is kind of reasonable as we can know which partitions will be > overwritten before runtime. However, hive has a different behavior that it > only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT > 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only > one data column and is partitioned by {{a}} and {{b}}. > It seems better if we can follow hive's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions
[ https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691472#comment-16691472 ] Mikalai Surta commented on SPARK-20236: --- [~cloud_fan] It was my fault, works fine, I deleted my comment as invalid > Overwrite a partitioned data source table should only overwrite related > partitions > -- > > Key: SPARK-20236 > URL: https://issues.apache.org/jira/browse/SPARK-20236 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: releasenotes > Fix For: 2.3.0 > > > When we overwrite a partitioned data source table, currently Spark will > truncate the entire table to write new data, or truncate a bunch of > partitions according to the given static partitions. > For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, > {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions > that starts with {{a=1}}. > This behavior is kind of reasonable as we can know which partitions will be > overwritten before runtime. However, hive has a different behavior that it > only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT > 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only > one data column and is partitioned by {{a}} and {{b}}. > It seems better if we can follow hive's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Attachment: run1-noparams-dominator-tree-externalsorter-gc-root.png > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Priority: Major > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_ > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Description: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. was: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_ The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Priority: Major > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _c
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Description: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. Here is dominator tree from the heapdump !run1-noparams-dominator-tree.png|width=500! was: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Prior
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691462#comment-16691462 ] Sai Varun Reddy Daram commented on SPARK-26019: --- Any help with https://issues.apache.org/jira/browse/SPARK-26113 > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Description: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. Here is dominator tree from the heapdump !run1-noparams-dominator-tree.png|width=700! 4 instances of ExternalSorter, although there are only 2 concurrently running tasks per executor. !run1-noparams-dominator-tree-externalsorter.png|width=700! And paths to GC root of the already stopped ExternalSorter. !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! was: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. Here is dominator tree from the heapdump !run1-noparams-dominator-tree.png|width=500! > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > >
[jira] [Commented] (SPARK-25380) Generated plans occupy over 50% of Spark driver memory
[ https://issues.apache.org/jira/browse/SPARK-25380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691519#comment-16691519 ] Dave DeCaprio commented on SPARK-25380: --- I would like to comment that we are also seeing this. 200Mb plans are not unusual for us. > Generated plans occupy over 50% of Spark driver memory > -- > > Key: SPARK-25380 > URL: https://issues.apache.org/jira/browse/SPARK-25380 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Spark 2.3.1 (AWS emr-5.16.0) > >Reporter: Michael Spector >Priority: Minor > Attachments: Screen Shot 2018-09-06 at 23.19.56.png, Screen Shot > 2018-09-12 at 8.20.05.png, heapdump_OOM.png, image-2018-09-16-14-21-38-939.png > > > When debugging an OOM exception during long run of a Spark application (many > iterations of the same code) I've found that generated plans occupy most of > the driver memory. I'm not sure whether this is a memory leak or not, but it > would be helpful if old plans could be purged from memory anyways. > Attached are screenshots of OOM heap dump opened in JVisualVM. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26113) TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py
[ https://issues.apache.org/jira/browse/SPARK-26113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sai Varun Reddy Daram resolved SPARK-26113. --- Resolution: Invalid > TypeError: object of type 'NoneType' has no len() in > authenticate_and_accum_updates of pyspark/accumulators.py > -- > > Key: SPARK-26113 > URL: https://issues.apache.org/jira/browse/SPARK-26113 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.0 >Reporter: Sai Varun Reddy Daram >Priority: Major > > Machine OS: Ubuntu 16.04. > Kubernetes: Minikube > Kubernetes Version: 1.10.0 > Spark Kubernetes Image: pyspark ( at docker hub: saivarunr/spark-py:2.4 ) > built using standard spark docker build.sh file. > Driver is inside pod in kubernetes cluster. > Steps to replicate: > 1) Create a spark Session: > {code:java} > // > spark_session=SparkSession.builder.master('k8s://https://192.168.99.100:8443').config('spark.executor.instances','1').config('spark.kubernetes.container.image','saivarunr/spark-py:2.4').getOrCreate() > {code} > 2) Create a sample DataFrame > {code:java} > // df=spark_session.createDataFrame([{'a':1}]) > {code} > 3) Do some operation on this dataframe > {code:java} > // df.count(){code} > I get this output. > {code:java} > // Exception happened during processing of request from ('127.0.0.1', 38690) > Traceback (most recent call last): > File "/usr/lib/python3.6/socketserver.py", line 317, in > _handle_request_noblock > self.process_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 348, in process_request > self.finish_request(request, client_address) > File "/usr/lib/python3.6/socketserver.py", line 361, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/usr/lib/python3.6/socketserver.py", line 721, in __init__ > self.handle() > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 266, in handle > poll(authenticate_and_accum_updates) > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 241, in poll > if func(): > File > "/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/accumulators.py", > line 254, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > {code} > 4) Repeat above step; it won't show the error. > > But now close the session, kill the python terminal or process. and try > again, the same happens. > > Something related to https://issues.apache.org/jira/browse/SPARK-26019 ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26114: Assignee: Apache Spark > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Assignee: Apache Spark >Priority: Major > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_. > Discussion is > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. > Here is dominator tree from the heapdump > !run1-noparams-dominator-tree.png|width=700! > 4 instances of ExternalSorter, although there are only 2 concurrently running > tasks per executor. > !run1-noparams-dominator-tree-externalsorter.png|width=700! > And paths to GC root of the already stopped ExternalSorter. > !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691575#comment-16691575 ] Apache Spark commented on SPARK-26114: -- User 'szhem' has created a pull request for this issue: https://github.com/apache/spark/pull/23083 > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Priority: Major > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_. > Discussion is > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. > Here is dominator tree from the heapdump > !run1-noparams-dominator-tree.png|width=700! > 4 instances of ExternalSorter, although there are only 2 concurrently running > tasks per executor. > !run1-noparams-dominator-tree-externalsorter.png|width=700! > And paths to GC root of the already stopped ExternalSorter. > !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26114: Assignee: (was: Apache Spark) > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Priority: Major > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_. > Discussion is > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. > Here is dominator tree from the heapdump > !run1-noparams-dominator-tree.png|width=700! > 4 instances of ExternalSorter, although there are only 2 concurrently running > tasks per executor. > !run1-noparams-dominator-tree-externalsorter.png|width=700! > And paths to GC root of the already stopped ExternalSorter. > !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module
Gengliang Wang created SPARK-26115: -- Summary: Fix deprecated warnings related to scala 2.12 in core module Key: SPARK-26115 URL: https://issues.apache.org/jira/browse/SPARK-26115 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Gengliang Wang {quote}[warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => SparkContext.this.reportHeartBeat()). [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, "driver-heartbeater", [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()). [warn] val providers = Seq(new HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++ [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => Executor.this.reportHeartBeat()). [warn] private val heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, [warn]^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102: method childGroup in class ServerBootstrap is deprecated: see corresponding Javadoc for more information. [warn] if (bootstrap != null && bootstrap.childGroup() != null) { [warn]^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] new ForkJoinTaskSupport(new ForkJoinPool(8)) [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59: value attemptId in class StageInfo is deprecated (since 2.3.0): Use attemptNumber instead [warn] def attemptNumber(): Int = attemptId [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] def newForkJoinPool(prefix: String, maxThreadNumber: Int): SForkJoinPool = { [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] override def newThread(pool: SForkJoinPool) = [warn]^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190: type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias [warn] new SForkJoinWorkerThread(pool) { [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] new SForkJoinPool(maxThreadNumber, factory, [warn] ^ [warn] 10 warnings found [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => SparkContext.this.reportHeartBeat()). [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, "driver-heartbeater", [warn] [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()). [warn] val providers = Seq(new HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++ [warn] [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102: method childGroup in class S
[jira] [Resolved] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module
[ https://issues.apache.org/jira/browse/SPARK-26115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-26115. Resolution: Duplicate Duplicated to SPARK-2609 > Fix deprecated warnings related to scala 2.12 in core module > > > Key: SPARK-26115 > URL: https://issues.apache.org/jira/browse/SPARK-26115 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > > {quote}[warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => SparkContext.this.reportHeartBeat()). > [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, > "driver-heartbeater", > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()). > [warn] val providers = Seq(new > HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++ > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => Executor.this.reportHeartBeat()). > [warn] private val heartbeater = new Heartbeater(env.memoryManager, > reportHeartBeat, > [warn]^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102: > method childGroup in class ServerBootstrap is deprecated: see corresponding > Javadoc for more information. > [warn] if (bootstrap != null && bootstrap.childGroup() != null) { > [warn]^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] new ForkJoinTaskSupport(new ForkJoinPool(8)) > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59: > value attemptId in class StageInfo is deprecated (since 2.3.0): Use > attemptNumber instead > [warn] def attemptNumber(): Int = attemptId > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] def newForkJoinPool(prefix: String, maxThreadNumber: Int): > SForkJoinPool = { > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] override def newThread(pool: SForkJoinPool) = > [warn]^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190: > type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): > use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias > [warn] new SForkJoinWorkerThread(pool) { > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] new SForkJoinPool(maxThreadNumber, factory, > [warn] ^ > [warn] 10 warnings found > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => SparkContext.this.reportHeartBeat()). > [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, > "driver-heartbeater", > [warn] > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenM
[jira] [Created] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
Pierre Lienhart created SPARK-26116: --- Summary: Spark SQL - Sort when writing partitioned parquet lead to OOM errors Key: SPARK-26116 URL: https://issues.apache.org/jira/browse/SPARK-26116 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Pierre Lienhart When writing partitioned parquet using ```partitionBy```, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : ``` Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. ``` ``` Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more ``` In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk (```INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk```). Sometimes the data is spilled in time to the disk and the sort completes (```INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.```) but sometimes it does not and we see multiple ```TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.``` until the application finally runs OOM with logs such as ```ERROR UnsafeExternalSorter: Unable to grow the pointer array```. I should mention t
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {code:java}partitionBy{code}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.{code}) but sometimes it does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.{code} until the application finally runs OOM with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the pointer array{code}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0.
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.{code}) but sometimes it does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.{code} until the application finally runs OOM with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the pointer array{code}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks lik
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : [Spark-12546|https://issues.apache
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related a
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related a
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related a
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related a
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related and led
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Summary: Spark SQL - Sort when writing partitioned parquet leads to OOM errors (was: Spark SQL - Sort when writing partitioned parquet lead to OOM errors) > Spark SQL - Sort when writing partitioned parquet leads to OOM errors > - > > Key: SPARK-26116 > URL: https://issues.apache.org/jira/browse/SPARK-26116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Pierre Lienhart >Priority: Major > > When writing partitioned parquet using {{partitionBy}}, it looks like Spark > sorts each partition before writing but this sort consumes a huge amount of > memory compared to the size of the data. The executors can then go OOM and > get killed by YARN. As a consequence, it also force to provision huge amount > of memory compared to the data to be written. > Error messages found in the Spark UI are like the following : > {code:java} > Spark UI description of failure : Job aborted due to stage failure: Task 169 > in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage > 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure > (executor 1 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > {code} > > {code:java} > Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most > recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, > executor 1): org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: error while calling spill() on > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : > /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad > (No such file or directory) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > ... 8 more{code} > > In the stderr logs, we can see that huge amount of sort data (the partition > being sorted here is 250 MB when persisted into memory, deserialized) is > being spilled to the disk ({{INFO UnsafeExternalSo
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : Spark-12546 is explicitly related a
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also forces to provision huge amounts of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related and l
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also forces to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related and le
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Description: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memoryOverhead=512 \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. Here is dominator tree from the heapdump !run1-noparams-dominator-tree.png|width=700! 4 instances of ExternalSorter, although there are only 2 concurrently running tasks per executor. !run1-noparams-dominator-tree-externalsorter.png|width=700! And paths to GC root of the already stopped ExternalSorter. !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! was: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. Here is dominator tree from the heapdump !run1-noparams-dominator-tree.png|width=700! 4 instances of ExternalSorter, although there are only 2 concurrently running tasks per executor. !run1-noparams-dominator-tree-externalsorter.png|width=700!
[jira] [Created] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
caoxuewen created SPARK-26117: - Summary: use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception Key: SPARK-26117 URL: https://issues.apache.org/jira/browse/SPARK-26117 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.5.0 Reporter: caoxuewen the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the entire executor when an OutOfMemoryError is thrown. so apply for memory using MemoryConsumer. allocatePage when catch exception, use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26117: Assignee: Apache Spark > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Major > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26117: Assignee: (was: Apache Spark) > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Priority: Major > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26026. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23069 [https://github.com/apache/spark/pull/23069] > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26043. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23066 [https://github.com/apache/spark/pull/23066] > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26043: - Assignee: Sean Owen > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691725#comment-16691725 ] Apache Spark commented on SPARK-26117: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/23084 > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Priority: Major > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk
[ https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26068. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23040 [https://github.com/apache/spark/pull/23040] > ChunkedByteBufferInputStream is truncated by empty chunk > > > Key: SPARK-26068 > URL: https://issues.apache.org/jira/browse/SPARK-26068 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liu, Linhong >Priority: Major > Fix For: 3.0.0 > > > If ChunkedByteBuffer contains empty chunk in the middle of it, then the > ChunkedByteBufferInputStream will be truncated. All data behind the empty > chunk will not be read. > The problematic code: > {code:java} > // ChunkedByteBuffer.scala > // Assume chunks.next returns an empty chunk, then we will reach > // else branch no matter chunks.hasNext = true or not. So some data is lost. > override def read(dest: Array[Byte], offset: Int, length: Int): Int = { > if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext) > { > currentChunk = chunks.next() > } > if (currentChunk != null && currentChunk.hasRemaining) { > val amountToGet = math.min(currentChunk.remaining(), length) > currentChunk.get(dest, offset, amountToGet) > amountToGet > } else { > close() > -1 > } > } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk
[ https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26068: --- Assignee: Liu, Linhong > ChunkedByteBufferInputStream is truncated by empty chunk > > > Key: SPARK-26068 > URL: https://issues.apache.org/jira/browse/SPARK-26068 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liu, Linhong >Assignee: Liu, Linhong >Priority: Major > Fix For: 3.0.0 > > > If ChunkedByteBuffer contains empty chunk in the middle of it, then the > ChunkedByteBufferInputStream will be truncated. All data behind the empty > chunk will not be read. > The problematic code: > {code:java} > // ChunkedByteBuffer.scala > // Assume chunks.next returns an empty chunk, then we will reach > // else branch no matter chunks.hasNext = true or not. So some data is lost. > override def read(dest: Array[Byte], offset: Int, length: Int): Int = { > if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext) > { > currentChunk = chunks.next() > } > if (currentChunk != null && currentChunk.hasRemaining) { > val amountToGet = math.min(currentChunk.remaining(), length) > currentChunk.get(dest, offset, amountToGet) > amountToGet > } else { > close() > -1 > } > } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26112) Update since versions of new built-in functions.
[ https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26112: --- Assignee: Takuya Ueshin > Update since versions of new built-in functions. > > > Key: SPARK-26112 > URL: https://issues.apache.org/jira/browse/SPARK-26112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > The following 5 functions were removed from branch-2.4: > - map_entries > - map_filter > - transform_values > - transform_keys > - map_zip_with > We should update the since version to 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour
[ https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26024. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23025 [https://github.com/apache/spark/pull/23025] > Dataset API: repartitionByRange(...) has inconsistent behaviour > --- > > Key: SPARK-26024 > URL: https://issues.apache.org/jira/browse/SPARK-26024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Spark version 2.3.2 >Reporter: Julien Peloton >Priority: Major > Labels: dataFrame, partitioning, repartition, spark-sql > Fix For: 3.0.0 > > > Hi, > I recently played with the {{repartitionByRange}} method for DataFrame > introduced in SPARK-22614. For DataFrames larger than the one tested in the > code (which has only 10 elements), the code sends back random results. > As a test for showing the inconsistent behaviour, I start as the unit code > used to test {{repartitionByRange}} > ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352]) > but I increase the size of the initial array to 1000, repartition using 3 > partitions, and count the number of element per-partitions: > > {code} > // Shuffle numbers from 0 to 1000, and make a DataFrame > val df = Random.shuffle(0.to(1000)).toDF("val") > // Repartition it using 3 partitions > // Sum up number of elements in each partition, and collect it. > // And do it several times > for (i <- 0 to 9) { > var counts = df.repartitionByRange(3, col("val")) > .mapPartitions{part => Iterator(part.size)} > .collect() > println(counts.toList) > } > // -> the number of elements in each partition varies... > {code} > I do not know whether it is expected (I will dig further in the code), but it > sounds like a bug. > Or I just misinterpret what {{repartitionByRange}} is for? > Any ideas? > Thanks! > Julien -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour
[ https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26024: --- Assignee: Julien Peloton > Dataset API: repartitionByRange(...) has inconsistent behaviour > --- > > Key: SPARK-26024 > URL: https://issues.apache.org/jira/browse/SPARK-26024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Spark version 2.3.2 >Reporter: Julien Peloton >Assignee: Julien Peloton >Priority: Major > Labels: dataFrame, partitioning, repartition, spark-sql > Fix For: 3.0.0 > > > Hi, > I recently played with the {{repartitionByRange}} method for DataFrame > introduced in SPARK-22614. For DataFrames larger than the one tested in the > code (which has only 10 elements), the code sends back random results. > As a test for showing the inconsistent behaviour, I start as the unit code > used to test {{repartitionByRange}} > ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352]) > but I increase the size of the initial array to 1000, repartition using 3 > partitions, and count the number of element per-partitions: > > {code} > // Shuffle numbers from 0 to 1000, and make a DataFrame > val df = Random.shuffle(0.to(1000)).toDF("val") > // Repartition it using 3 partitions > // Sum up number of elements in each partition, and collect it. > // And do it several times > for (i <- 0 to 9) { > var counts = df.repartitionByRange(3, col("val")) > .mapPartitions{part => Iterator(part.size)} > .collect() > println(counts.toList) > } > // -> the number of elements in each partition varies... > {code} > I do not know whether it is expected (I will dig further in the code), but it > sounds like a bug. > Or I just misinterpret what {{repartitionByRange}} is for? > Any ideas? > Thanks! > Julien -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26112) Update since versions of new built-in functions.
[ https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26112. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23082 [https://github.com/apache/spark/pull/23082] > Update since versions of new built-in functions. > > > Key: SPARK-26112 > URL: https://issues.apache.org/jira/browse/SPARK-26112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > The following 5 functions were removed from branch-2.4: > - map_entries > - map_filter > - transform_values > - transform_keys > - map_zip_with > We should update the since version to 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691787#comment-16691787 ] Andrew Otto commented on SPARK-14492: - I found my issue: we were loading some Hive 1.1.0 jars manually using spark.driver.extraClassPath in order to initiate a Hive JDBC directly to Hive, instead of using spark.sql() to work around https://issues.apache.org/jira/browse/SPARK-23890. The Hive 1.1.0 classes were loaded before the ones included with Spark, and as such they failed referencing a Hive configuration that didn't exist in 1.1.0. > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26071) disallow map as map key
[ https://issues.apache.org/jira/browse/SPARK-26071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26071. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23045 [https://github.com/apache/spark/pull/23045] > disallow map as map key > --- > > Key: SPARK-26071 > URL: https://issues.apache.org/jira/browse/SPARK-26071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25528) data source V2 API refactoring (batch read)
[ https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691799#comment-16691799 ] Apache Spark commented on SPARK-25528: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23086 > data source V2 API refactoring (batch read) > --- > > Key: SPARK-25528 > URL: https://issues.apache.org/jira/browse/SPARK-25528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > refactor the read side API according to this abstraction > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25528) data source V2 API refactoring (batch read)
[ https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25528: Summary: data source V2 API refactoring (batch read) (was: data source V2 read side API refactoring) > data source V2 API refactoring (batch read) > --- > > Key: SPARK-25528 > URL: https://issues.apache.org/jira/browse/SPARK-25528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > refactor the read side API according to this abstraction > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691811#comment-16691811 ] Sunil Rangwani commented on SPARK-14492: [~srowen] Please reopen this. It is not about building with varying versions of Hive. After building with the required version of Hive, it gives a runtime error - as explained in my earlier comment. > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691828#comment-16691828 ] Sunil Rangwani commented on SPARK-14492: [~ottomata] you are using a CDH version I believe. I don't know what dependencies their parcels include. This bug was originally found in the open source version built using maven as described here [https://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support] > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3
[ https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26090. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23065 [https://github.com/apache/spark/pull/23065] > Resolve most miscellaneous deprecation and build warnings for Spark 3 > - > > Key: SPARK-26090 > URL: https://issues.apache.org/jira/browse/SPARK-26090 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > The build has a lot of deprecation warnings. Some are new in Scala 2.12 and > Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy > miscellaneous ones here. > They're too numerous and small to list here; see the pull request. Some > highlights: > - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in > Java. Instead, case classes can explicitly declare getters > - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases > - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0 > - finalize() is finally deprecated (just needs to be suppressed) > - StageInfo.attempId was deprecated and easiest to remove here > I'm not now going to touch some chunks of deprecation warnings: > - Parquet deprecations > - Hive deprecations (particularly serde2 classes) > - Deprecations in generated code (mostly Thriftserver CLI) > - ProcessingTime deprecations (we may need to revive this class as internal) > - many MLlib deprecations because they concern methods that may be removed > anyway > - a few Kinesis deprecations I couldn't figure out > - Mesos get/setRole, which I don't know well > - Kafka/ZK deprecations (e.g. poll()) > - Kinesis > - a few other ones that will probably resolve by deleting a deprecated method -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23886) update query.status
[ https://issues.apache.org/jira/browse/SPARK-23886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691881#comment-16691881 ] Gabor Somogyi commented on SPARK-23886: --- Thanks [~efim.poberezkin]! May I ask you then to close your PRs for this Jira and for SPARK-24063? > update query.status > --- > > Key: SPARK-23886 > URL: https://issues.apache.org/jira/browse/SPARK-23886 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16691965#comment-16691965 ] Long Cao commented on SPARK-26026: -- [~srowen] thanks for the expedient resolution! It's a small thing to me but nice to see. > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
Attila Zsolt Piros created SPARK-26118: -- Summary: Make Jetty's requestHeaderSize configurable in Spark Key: SPARK-26118 URL: https://issues.apache.org/jira/browse/SPARK-26118 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.0, 2.3.2, 2.2.2, 2.1.3, 3.0.0 Reporter: Attila Zsolt Piros For long authorization fields the request header size could be over the default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request Entity Too Large). This issue may occur if the user is a member of many Active Directory user groups. The HTTP request to the server contains the Kerberos token in the WWW-Authenticate header. The header size increases together with the number of user groups. Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26118: --- Affects Version/s: (was: 2.3.2) (was: 2.4.0) (was: 2.2.2) (was: 2.1.3) > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692018#comment-16692018 ] Attila Zsolt Piros commented on SPARK-26118: I am working on a PR. > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
ABHISHEK KUMAR GUPTA created SPARK-26119: Summary: Task metrics summary in the stage page should contain only successful tasks metrics Key: SPARK-26119 URL: https://issues.apache.org/jira/browse/SPARK-26119 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 2.3.2 Reporter: ABHISHEK KUMAR GUPTA Currently task metrics summary table in the stage tab shows summary corresponds to all the tasks. But, we should display the summary of only succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692093#comment-16692093 ] shahid commented on SPARK-26119: Thanks. I am working on it > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26120: - Component/s: Structured Streaming SparkR > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692175#comment-16692175 ] Apache Spark commented on SPARK-26119: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/23088 > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26120: - Priority: Minor (was: Major) > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26120: Assignee: Apache Spark (was: Shixiong Zhu) > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26119: Assignee: (was: Apache Spark) > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692176#comment-16692176 ] Apache Spark commented on SPARK-26120: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/23089 > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.o
[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26119: Assignee: Apache Spark > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Apache Spark >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692178#comment-16692178 ] Apache Spark commented on SPARK-26120: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/23089 > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.o
[jira] [Created] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
Shixiong Zhu created SPARK-26120: Summary: Fix a streaming query leak in Structured Streaming R tests Key: SPARK-26120 URL: https://issues.apache.org/jira/browse/SPARK-26120 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu "Specify a schema by using a DDL-formatted string when reading" doesn't stop the streaming query before stopping Spark. It causes the following annoying logs. {code} Exception in thread "stream execution thread for [id = 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) at org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) ... 7 more org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) at org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) ... 7 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26120: Assignee: Shixiong Zhu (was: Apache Spark) > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26121) [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id)
Anastasios Zouzias created SPARK-26121: -- Summary: [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id) Key: SPARK-26121 URL: https://issues.apache.org/jira/browse/SPARK-26121 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Anastasios Zouzias I run in the following situation with Spark Structure Streaming (SS) using Kafka. In a project that I work on, there is already a secured Kafka setup where ops can issue an SSL certificate per "[group.id|http://group.id/]";, which should be predefined (or its prefix to be predefined). On the other hand, Spark SS fixes the [group.id|http://group.id/] to val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}" see, i.e., [https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124] https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81 I guess Spark developers had a good reason to fix it, but is it possible to make configurable the prefix of the above uniqueGroupId ("spark-kafka-source")? The rational is that spark users are not forced to use the same certificate on group-ids of the form (spark-kafka-source-*). DoD: * Allow spark SS users to define the group.id prefix as input parameter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov reopened SPARK-26019: --- Reproduced myself > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692219#comment-16692219 ] Ruslan Dautkhanov commented on SPARK-26019: --- [~hyukjin.kwon] today I reproduced this first time .. but we still receive reports from other our users as well. Here's code on Spark 2.3.2 + Python 2.7.15. Execute on a freshly created Spark session : {code:python} def python_major_version (): import sys return(sys.version_info[0]) print(python_major_version()) print(sc.parallelize([1]).map(lambda x: python_major_version()).collect())# error happens here ! {code} It always reproduces for me. Notice that just rerunning the same code makes this error disappear. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692226#comment-16692226 ] Ruslan Dautkhanov commented on SPARK-26019: --- Might be broken by https://github.com/apache/spark/pull/22635 change > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692233#comment-16692233 ] Ruslan Dautkhanov commented on SPARK-26019: --- Sorry, nope it was broken by this change - https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650#diff-c3339bbf2b850b79445b41e9eecf57c4R249 > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692237#comment-16692237 ] Ruslan Dautkhanov commented on SPARK-26019: --- cc [~lucacanali] > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26118: Assignee: (was: Apache Spark) > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692246#comment-16692246 ] Apache Spark commented on SPARK-26118: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/23090 > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692248#comment-16692248 ] Apache Spark commented on SPARK-26118: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/23090 > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26118: Assignee: Apache Spark > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26122) Support encoding for multiLine in CSV datasource
Maxim Gekk created SPARK-26122: -- Summary: Support encoding for multiLine in CSV datasource Key: SPARK-26122 URL: https://issues.apache.org/jira/browse/SPARK-26122 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, CSV datasource is not able to read CSV files in different encoding when multiLine is enabled. The ticket aims to support the encoding CSV options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692305#comment-16692305 ] Apache Spark commented on SPARK-26122: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23091 > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26122: Assignee: Apache Spark > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692306#comment-16692306 ] Apache Spark commented on SPARK-26122: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23091 > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26122: Assignee: (was: Apache Spark) > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692355#comment-16692355 ] Imran Rashid commented on SPARK-26019: -- [~hyukjin.kwon] you think maybe these two lines need to be swapped? https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L274-L275 any change you can try with that change [~Tagar] since you've got a handy environment to reproduce? > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692377#comment-16692377 ] Ruslan Dautkhanov commented on SPARK-26019: --- [~irashid] thanks a lot for looking at this ! It makes sense two swap those two lines to call parent class constructor after auth_token has been initialized. We're using Cloudera's Spark, and pyspark dependencies are inside of a zip file, in a "immutable" parcel... Unfortunately there is no quick way to test it as it has to be propagated into all worker nodes. [~bersprockets] any ideas how to test this? Thank you. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692380#comment-16692380 ] Apache Spark commented on SPARK-26094: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/23092 > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26094: Assignee: Imran Rashid (was: Apache Spark) > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26094: Assignee: Apache Spark (was: Imran Rashid) > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692382#comment-16692382 ] Apache Spark commented on SPARK-26094: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/23092 > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26123) Make all MLlib/ML Models public
Nicholas Resnick created SPARK-26123: Summary: Make all MLlib/ML Models public Key: SPARK-26123 URL: https://issues.apache.org/jira/browse/SPARK-26123 Project: Spark Issue Type: Wish Components: ML, MLlib Affects Versions: 2.4.0 Reporter: Nicholas Resnick Can all Model subclasses be made public? It's very difficult to make custom Models that build off of MLlib/ML Models otherwise. I also can't think of a reason why Estimators should be public, but not Models. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692495#comment-16692495 ] Hyukjin Kwon commented on SPARK-26019: -- I don't understand how the swapping two lines solve this issue. Also, I don't get how it happens and how to reproduce this. [~Tagar], It's you who reported the issue and it looks odd to ask an arbitrary guy for an idea to test it in Apache JIRA. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26019. -- Resolution: Cannot Reproduce > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
[ https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pawan updated SPARK-25919: -- Affects Version/s: 2.1.0 > Date value corrupts when tables are "ParquetHiveSerDe" formatted and target > table is Partitioned > > > Key: SPARK-25919 > URL: https://issues.apache.org/jira/browse/SPARK-25919 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.1 >Reporter: Pawan >Priority: Major > > Hi > I found a really strange issue. Below are the steps to reproduce it. This > issue occurs only when the table row format is ParquetHiveSerDe and the > target table is Partitioned > *Hive:* > Login in to hive terminal on cluster and create below tables. > {code:java} > create table t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > create table t_tgt( > name varchar(10), > dob timestamp > ) > PARTITIONED BY (city varchar(10)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > {code} > Insert data into the source table (t_src) > {code:java} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 > 00:00:00.0');{code} > *Spark-shell:* > Get on to spark-shell. > Execute below commands on spark shell: > {code:java} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM > DEFAULT.t_src alias" > val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as > c0, tbl0.a1 as c1, NULL as c2 FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > After this check the contents of target table t_tgt. You will see the date > "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows > the contents of both the tables: > {code:java} > select * from t_src; > +-++--+ > | t_src.name | t_src.dob | > +-++--+ > | p1 | 0001-01-01 00:00:00.0 | > | p2 | 0002-01-01 00:00:00.0 | > | p3 | 0003-01-01 00:00:00.0 | > | p4 | 0004-01-01 00:00:00.0 | > +-++–+ > select * from t_tgt; > +-++--+ > | t_src.name | t_src.dob | t_tgt.city | > +-++--+ > | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF | > | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF | > +-++--+ > {code} > > Is this a known issue? Is it fixed in any subsequent releases? > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692498#comment-16692498 ] Hyukjin Kwon commented on SPARK-26019: -- Please reopen if this is still being reproduced. Or if there's any analysis for this issue about why this happens. Swapping two lines should be fine but I don't understand why and how. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692504#comment-16692504 ] Liang-Chi Hsieh commented on SPARK-26019: - Isn't the server beginning to handle requests after calling {{serve_forever}}? It is called after {{AccumulatorServer}} initialization. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org