[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26019. -- Resolution: Cannot Reproduce Reopen this if this issue is persistent in Apache master, or if the analysis for it is very clear. I vaguely suspect it's somehow related with Cloudera's parcel. If the usecase is strong enough (I guess so), and we're clear on what's the cause, switching two lines should be fine. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692778#comment-16692778 ] Hyukjin Kwon commented on SPARK-26019: -- Also, I followed the guide: {quote} For issues that can’t be reproduced against master as reported, resolve as Cannot Reproduce {quote} You missed this. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692773#comment-16692773 ] Hyukjin Kwon edited comment on SPARK-26019 at 11/20/18 7:29 AM: [~Tagar], can you reproduce this in Apache master? Also, we don't know why it causes the error. I also checked what [~viirya] checked and it doesn't make sense that's the cause. was (Author: hyukjin.kwon): [~Tagar], can you reproduce this in Apache master? > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692773#comment-16692773 ] Hyukjin Kwon commented on SPARK-26019: -- [~Tagar], can you reproduce this in Apache master? > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692769#comment-16692769 ] Hyukjin Kwon commented on SPARK-26019: -- I can't reproduce the issue in my local and yarn cluster. That's why I resolved this JIRA. If it's a Cloudera distribution, please use a channel for Cloudera. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692722#comment-16692722 ] Liang-Chi Hsieh commented on SPARK-26019: - {{TCPServer}} begins to process requests only after we call its {{serve_forever}} method. When the server goes to process a request, it will instantiate {{RequestHandlerClass}}: https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L332 As {{RequestHandlerClass}} in {{AccumulatorServer}} is {{_UpdateRequestHandler}}, at the moment, {{_UpdateRequestHandler.handle}} is called and to access {{auth_token}}. All of above should be happened after we call {{serve_forever}} in {{_start_update_server}}: {code} def _start_update_server(auth_token): """Start a TCP server to receive accumulator updates in a daemon thread, and returns it""" server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler, auth_token) thread = threading.Thread(target=server.serve_forever) {code} And I think that we already finish instantiating {{AccumulatorServer}} and {{self.auth_token = auth_token}} is run. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module
liupengcheng created SPARK-26126: Summary: Should put scala-library deps into root pom instead of spark-tags module Key: SPARK-26126 URL: https://issues.apache.org/jira/browse/SPARK-26126 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 2.3.0, 2.1.0 Reporter: liupengcheng When I do some backport in our custom spark, I notice some strange code from spark-tags module: {code:java} org.scala-lang scala-library ${scala.version} {code} As i known, should spark-tags only contains some annotation related classes or deps? should we put the scala-library deps to root pom? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov reopened SPARK-26019: --- > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692716#comment-16692716 ] Ruslan Dautkhanov commented on SPARK-26019: --- [~viirya] exception stack reads that error happened in SocketServer.py, BaseRequestHandler class constructor, excerpt from the full exception stack above : {code:python} ... File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 652, in __init__ self.handle() ... {code} Notice constructor here calls `self.handle()` - https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L655 `handle()` is defined in derived class _UpdateRequestHandler here https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232 and expects `auth_token` to be set : https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L254 - that's exactly where exception happens. [~irashid] was right - those two lines have to be swapped. [~hyukjin.kwon] that's odd you closed this jira, although I said it always reproduces for me (100 % of times ), and even [posted reproducer here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=13197858=16692219]. [~saivarunvishal] also said it happens for him in SPARK-26113 and you closed that jira as well. It seems not in line with https://spark.apache.org/contributing.html - "Contributing Bug Reports". Please let me know what I miss here. I called out [~bersprockets] because we use Cloudera distribution of Spark and Cloudera has a few patches on top of open-source Spark. I wanted to make sure it's not Cloudera distro specific. Also we worked with Bruce on several other Spark issue and noticed here's in watchers list on this jira... Now I see that this issue is not Cloudera specific though. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26125) Delegation Token seems not appropriately stored on secrets of Kubernetes/Kerberized HDFS
Kei Kori created SPARK-26125: Summary: Delegation Token seems not appropriately stored on secrets of Kubernetes/Kerberized HDFS Key: SPARK-26125 URL: https://issues.apache.org/jira/browse/SPARK-26125 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0 Reporter: Kei Kori Attachments: spark-submit-stern.log I tried Kerberos authentication with Kubernetes Resource Manager and an external Hadoop and KDC. I tested built on [6c9c84f|https://github.com/apache/spark/commit/6c9c84ffb9c8d98ee2ece7ba4b010856591d383d] (master + SPARK-23257). {code} $ bin/spark-submit \ --deploy-mode cluster \ --class org.apache.spark.examples.HdfsTest \ --master k8s://https://master01.node:6443 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.app.name=spark-hdfs \ --conf spark.executer.instances=1 \ --conf spark.kubernetes.container.image=docker-registry/kkori/spark:6c9c84f \ --conf spark.kubernetes.kerberos.enabled=true \ --conf spark.kubernetes.kerberos.krb5.configMapName=krb5-conf \ --conf spark.kubernetes.kerberos.keytab=/tmp/test.keytab \ --conf spark.kubernetes.kerberos.principal=t...@external.kerberos.realm.com \ --conf spark.kubernetes.hadoop.configMapName=hadoop-conf \ local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar {code} I successfully submitted into Kubernetes RM and Kubernetes spawned spark-driver and executors, but Hadoop Delegation Token seems wrongly stored into Kubernetes secrets, since that contains only header like below: {code} $ kubectl get secrets spark-hdfs-1542613661459-delegation-tokens -o jsonpath='{.data.hadoop-tokens}' | {base64 -d | cat -A; echo;} HDTS^@^@^@ {code} The result of "kubectl get secrets" should be like folloing(I masked the actual result): {code} HDTS^@^ha-hdfs:test^@^_t...@external.kerberos.realm.com^@^@ {code} As a result, spark-driver threw GSSException for each access of HDFS. Full logs(submit, driver, executor) are attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26125) Delegation Token seems not appropriately stored on secrets of Kubernetes/Kerberized HDFS
[ https://issues.apache.org/jira/browse/SPARK-26125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kei Kori updated SPARK-26125: - Attachment: spark-submit-stern.log > Delegation Token seems not appropriately stored on secrets of > Kubernetes/Kerberized HDFS > > > Key: SPARK-26125 > URL: https://issues.apache.org/jira/browse/SPARK-26125 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Kei Kori >Priority: Minor > Attachments: spark-submit-stern.log > > > I tried Kerberos authentication with Kubernetes Resource Manager and an > external Hadoop and KDC. > I tested built on > [6c9c84f|https://github.com/apache/spark/commit/6c9c84ffb9c8d98ee2ece7ba4b010856591d383d] > (master + SPARK-23257). > {code} > $ bin/spark-submit \ > --deploy-mode cluster \ > --class org.apache.spark.examples.HdfsTest \ > --master k8s://https://master01.node:6443 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.app.name=spark-hdfs \ > --conf spark.executer.instances=1 \ > --conf > spark.kubernetes.container.image=docker-registry/kkori/spark:6c9c84f \ > --conf spark.kubernetes.kerberos.enabled=true \ > --conf spark.kubernetes.kerberos.krb5.configMapName=krb5-conf \ > --conf spark.kubernetes.kerberos.keytab=/tmp/test.keytab \ > --conf > spark.kubernetes.kerberos.principal=t...@external.kerberos.realm.com \ > --conf spark.kubernetes.hadoop.configMapName=hadoop-conf \ > local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar > {code} > I successfully submitted into Kubernetes RM and Kubernetes spawned > spark-driver and executors, > but Hadoop Delegation Token seems wrongly stored into Kubernetes secrets, > since that contains only header like below: > {code} > $ kubectl get secrets spark-hdfs-1542613661459-delegation-tokens -o > jsonpath='{.data.hadoop-tokens}' | {base64 -d | cat -A; echo;} > HDTS^@^@^@ > {code} > The result of "kubectl get secrets" should be like folloing(I masked the > actual result): > {code} > HDTS^@^ha-hdfs:test^@^_t...@external.kerberos.realm.com^@^@ > {code} > As a result, spark-driver threw GSSException for each access of HDFS. > Full logs(submit, driver, executor) are attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models
[ https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692567#comment-16692567 ] Dagang Wei commented on SPARK-25959: Can we backport the fix "[SPARK-25959][ML] GBTClassifier picks wrong impurity stats on loading" (e00cac9) to Spark 2.2+? I tried to cherry-pick it to 2.2, but there are 2 conflicts I don't know how to resolve correctly: both modified: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala both modified: project/MimaExcludes.scala > Difference in featureImportances results on computed vs saved models > > > Key: SPARK-25959 > URL: https://issues.apache.org/jira/browse/SPARK-25959 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Suraj Nayak >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > I tried to implement GBT and found that the feature Importance computed while > the model was fit is different when the same model was saved into a storage > and loaded back. > > I also found that once the persistent model is loaded and saved back again > and loaded, the feature importance remains the same. > > Not sure if its bug while storing and reading the model first time or am > missing some parameter that need to be set before saving the model (thus > model is picking some defaults - causing feature importance to change) > > *Below is the test code:* > val testDF = Seq( > (1, 3, 2, 1, 1), > (3, 2, 1, 2, 0), > (2, 2, 1, 1, 0), > (3, 4, 2, 2, 0), > (2, 2, 1, 3, 1) > ).toDF("a", "b", "c", "d", "e") > val featureColumns = testDF.columns.filter(_ != "e") > // Assemble the features into a vector > val assembler = new > VectorAssembler().setInputCols(featureColumns).setOutputCol("features") > // Transform the data to get the feature data set > val featureDF = assembler.transform(testDF) > // Train a GBT model. > val gbt = new GBTClassifier() > .setLabelCol("e") > .setFeaturesCol("features") > .setMaxDepth(2) > .setMaxBins(5) > .setMaxIter(10) > .setSeed(10) > .fit(featureDF) > gbt.transform(featureDF).show(false) > // Write out the model > featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* Prints > (d,0.5931875075767403) > (a,0.3747184548362353) > (b,0.03209403758702444) > (c,0.0) > */ > gbt.write.overwrite().save("file:///tmp/test123") > println("Reading model again") > val gbtload = GBTClassificationModel.load("file:///tmp/test123") > featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* > Prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ > gbtload.write.overwrite().save("file:///tmp/test123_rewrite") > val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite") > featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println) > /* prints > (d,0.6455841215290767) > (a,0.3316126797964181) > (b,0.022803198674505094) > (c,0.0) > */ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name
[ https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692546#comment-16692546 ] Takeshi Yamamuro commented on SPARK-26077: -- Thanks for your reporting! It seems the master doesn't quote a JDBC table name now. Are u interested in the contribution on this? I think this is a kind of starter issues, maybe... > Reserved SQL words are not escaped by JDBC writer for table name > > > Key: SPARK-26077 > URL: https://issues.apache.org/jira/browse/SPARK-26077 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Eugene Golovan >Priority: Major > > This bug is similar to SPARK-16387 but this time table name is not escaped. > How to reproduce: > 1/ Start spark shell with mysql connector > spark-shell --jars ./mysql-connector-java-8.0.13.jar > > 2/ Execute next code > > import spark.implicits._ > (spark > .createDataset(Seq("a","b","c")) > .toDF("order") > .write > .format("jdbc") > .option("url", s"jdbc:mysql://root@localhost:3306/test") > .option("driver", "com.mysql.cj.jdbc.Driver") > .option("dbtable", "condition") > .save) > > , where condition - is reserved word. > > Error message: > > java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check > the manual that corresponds to your MySQL server version for the right syntax > to use near 'condition (`order` TEXT )' at line 1 > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120) > at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97) > at > com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) > at > com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355) > at > com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128) > at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > ... 59 elided > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26124) Update plugins, including MiMa
[ https://issues.apache.org/jira/browse/SPARK-26124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26124: Assignee: Apache Spark (was: Sean Owen) > Update plugins, including MiMa > -- > > Key: SPARK-26124 > URL: https://issues.apache.org/jira/browse/SPARK-26124 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > For Spark 3, we should update plugins to their latest version where possible, > to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, > though that introduces some new errors on old changes due to some changes in > MiMa. > Most SBT plugins can't really be updated further without updating to SBT 1.x, > and that will require some changes to the build, and it generally seems like > all of these new versions are for Scala 2.12+, including the new zinc. That > will probably be a bigger change but only after deciding to drop Scala 2.11 > support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26124) Update plugins, including MiMa
[ https://issues.apache.org/jira/browse/SPARK-26124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26124: Assignee: Sean Owen (was: Apache Spark) > Update plugins, including MiMa > -- > > Key: SPARK-26124 > URL: https://issues.apache.org/jira/browse/SPARK-26124 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > For Spark 3, we should update plugins to their latest version where possible, > to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, > though that introduces some new errors on old changes due to some changes in > MiMa. > Most SBT plugins can't really be updated further without updating to SBT 1.x, > and that will require some changes to the build, and it generally seems like > all of these new versions are for Scala 2.12+, including the new zinc. That > will probably be a bigger change but only after deciding to drop Scala 2.11 > support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26124) Update plugins, including MiMa
[ https://issues.apache.org/jira/browse/SPARK-26124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692539#comment-16692539 ] Apache Spark commented on SPARK-26124: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/23087 > Update plugins, including MiMa > -- > > Key: SPARK-26124 > URL: https://issues.apache.org/jira/browse/SPARK-26124 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > For Spark 3, we should update plugins to their latest version where possible, > to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, > though that introduces some new errors on old changes due to some changes in > MiMa. > Most SBT plugins can't really be updated further without updating to SBT 1.x, > and that will require some changes to the build, and it generally seems like > all of these new versions are for Scala 2.12+, including the new zinc. That > will probably be a bigger change but only after deciding to drop Scala 2.11 > support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26124) Update plugins, including MiMa
Sean Owen created SPARK-26124: - Summary: Update plugins, including MiMa Key: SPARK-26124 URL: https://issues.apache.org/jira/browse/SPARK-26124 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Sean Owen Assignee: Sean Owen For Spark 3, we should update plugins to their latest version where possible, to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, though that introduces some new errors on old changes due to some changes in MiMa. Most SBT plugins can't really be updated further without updating to SBT 1.x, and that will require some changes to the build, and it generally seems like all of these new versions are for Scala 2.12+, including the new zinc. That will probably be a bigger change but only after deciding to drop Scala 2.11 support. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26107) Extend ReplaceNullWithFalseInPredicate to support higher-order functions: ArrayExists, ArrayFilter, MapFilter
[ https://issues.apache.org/jira/browse/SPARK-26107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26107. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23079 [https://github.com/apache/spark/pull/23079] > Extend ReplaceNullWithFalseInPredicate to support higher-order functions: > ArrayExists, ArrayFilter, MapFilter > - > > Key: SPARK-26107 > URL: https://issues.apache.org/jira/browse/SPARK-26107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 3.0.0 > > > Extend the {{ReplaceNullWithFalse}} optimizer rule introduced in SPARK-25860 > to also support optimizing predicates in higher-order functions of > {{ArrayExists}}, {{ArrayFilter}}, {{MapFilter}}. > Also rename the rule to {{ReplaceNullWithFalseInPredicate}} to better reflect > its intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26107) Extend ReplaceNullWithFalseInPredicate to support higher-order functions: ArrayExists, ArrayFilter, MapFilter
[ https://issues.apache.org/jira/browse/SPARK-26107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26107: --- Assignee: Kris Mok > Extend ReplaceNullWithFalseInPredicate to support higher-order functions: > ArrayExists, ArrayFilter, MapFilter > - > > Key: SPARK-26107 > URL: https://issues.apache.org/jira/browse/SPARK-26107 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 3.0.0 > > > Extend the {{ReplaceNullWithFalse}} optimizer rule introduced in SPARK-25860 > to also support optimizing predicates in higher-order functions of > {{ArrayExists}}, {{ArrayFilter}}, {{MapFilter}}. > Also rename the rule to {{ReplaceNullWithFalseInPredicate}} to better reflect > its intent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692504#comment-16692504 ] Liang-Chi Hsieh commented on SPARK-26019: - Isn't the server beginning to handle requests after calling {{serve_forever}}? It is called after {{AccumulatorServer}} initialization. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692498#comment-16692498 ] Hyukjin Kwon commented on SPARK-26019: -- Please reopen if this is still being reproduced. Or if there's any analysis for this issue about why this happens. Swapping two lines should be fine but I don't understand why and how. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned
[ https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pawan updated SPARK-25919: -- Affects Version/s: 2.1.0 > Date value corrupts when tables are "ParquetHiveSerDe" formatted and target > table is Partitioned > > > Key: SPARK-25919 > URL: https://issues.apache.org/jira/browse/SPARK-25919 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, Spark Submit >Affects Versions: 2.1.0, 2.2.1 >Reporter: Pawan >Priority: Major > > Hi > I found a really strange issue. Below are the steps to reproduce it. This > issue occurs only when the table row format is ParquetHiveSerDe and the > target table is Partitioned > *Hive:* > Login in to hive terminal on cluster and create below tables. > {code:java} > create table t_src( > name varchar(10), > dob timestamp > ) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' > create table t_tgt( > name varchar(10), > dob timestamp > ) > PARTITIONED BY (city varchar(10)) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'; > {code} > Insert data into the source table (t_src) > {code:java} > INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 > 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 > 00:00:00.0');{code} > *Spark-shell:* > Get on to spark-shell. > Execute below commands on spark shell: > {code:java} > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM > DEFAULT.t_src alias" > val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as > c0, tbl0.a1 as c1, NULL as c2 FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > {code} > After this check the contents of target table t_tgt. You will see the date > "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows > the contents of both the tables: > {code:java} > select * from t_src; > +-++--+ > | t_src.name | t_src.dob | > +-++--+ > | p1 | 0001-01-01 00:00:00.0 | > | p2 | 0002-01-01 00:00:00.0 | > | p3 | 0003-01-01 00:00:00.0 | > | p4 | 0004-01-01 00:00:00.0 | > +-++–+ > select * from t_tgt; > +-++--+ > | t_src.name | t_src.dob | t_tgt.city | > +-++--+ > | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF | > | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF | > | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF | > +-++--+ > {code} > > Is this a known issue? Is it fixed in any subsequent releases? > Thanks & regards, > Pawan Lawale -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26019. -- Resolution: Cannot Reproduce > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692495#comment-16692495 ] Hyukjin Kwon commented on SPARK-26019: -- I don't understand how the swapping two lines solve this issue. Also, I don't get how it happens and how to reproduce this. [~Tagar], It's you who reported the issue and it looks odd to ask an arbitrary guy for an idea to test it in Apache JIRA. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26123) Make all MLlib/ML Models public
Nicholas Resnick created SPARK-26123: Summary: Make all MLlib/ML Models public Key: SPARK-26123 URL: https://issues.apache.org/jira/browse/SPARK-26123 Project: Spark Issue Type: Wish Components: ML, MLlib Affects Versions: 2.4.0 Reporter: Nicholas Resnick Can all Model subclasses be made public? It's very difficult to make custom Models that build off of MLlib/ML Models otherwise. I also can't think of a reason why Estimators should be public, but not Models. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692382#comment-16692382 ] Apache Spark commented on SPARK-26094: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/23092 > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26094: Assignee: Apache Spark (was: Imran Rashid) > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692380#comment-16692380 ] Apache Spark commented on SPARK-26094: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/23092 > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs
[ https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26094: Assignee: Imran Rashid (was: Apache Spark) > Streaming WAL should create parent dirs > --- > > Key: SPARK-26094 > URL: https://issues.apache.org/jira/browse/SPARK-26094 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Blocker > > SPARK-25871 introduced a regression in the streaming WAL -- it no longer > makes all the parent dirs, so you may see an exception like this in cases > that used to work: > {noformat} > 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: > Failed to write to write ahead log after 3 failures > ... > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at > org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210) > ... > Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: > /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692377#comment-16692377 ] Ruslan Dautkhanov commented on SPARK-26019: --- [~irashid] thanks a lot for looking at this ! It makes sense two swap those two lines to call parent class constructor after auth_token has been initialized. We're using Cloudera's Spark, and pyspark dependencies are inside of a zip file, in a "immutable" parcel... Unfortunately there is no quick way to test it as it has to be propagated into all worker nodes. [~bersprockets] any ideas how to test this? Thank you. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692355#comment-16692355 ] Imran Rashid commented on SPARK-26019: -- [~hyukjin.kwon] you think maybe these two lines need to be swapped? https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L274-L275 any change you can try with that change [~Tagar] since you've got a handy environment to reproduce? > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26122: Assignee: (was: Apache Spark) > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692306#comment-16692306 ] Apache Spark commented on SPARK-26122: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23091 > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692305#comment-16692305 ] Apache Spark commented on SPARK-26122: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23091 > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26122: Assignee: Apache Spark > Support encoding for multiLine in CSV datasource > > > Key: SPARK-26122 > URL: https://issues.apache.org/jira/browse/SPARK-26122 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Currently, CSV datasource is not able to read CSV files in different encoding > when multiLine is enabled. The ticket aims to support the encoding CSV > options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26122) Support encoding for multiLine in CSV datasource
Maxim Gekk created SPARK-26122: -- Summary: Support encoding for multiLine in CSV datasource Key: SPARK-26122 URL: https://issues.apache.org/jira/browse/SPARK-26122 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, CSV datasource is not able to read CSV files in different encoding when multiLine is enabled. The ticket aims to support the encoding CSV options in the mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26118: Assignee: Apache Spark > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Assignee: Apache Spark >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692248#comment-16692248 ] Apache Spark commented on SPARK-26118: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/23090 > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692246#comment-16692246 ] Apache Spark commented on SPARK-26118: -- User 'attilapiros' has created a pull request for this issue: https://github.com/apache/spark/pull/23090 > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26118: Assignee: (was: Apache Spark) > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692237#comment-16692237 ] Ruslan Dautkhanov commented on SPARK-26019: --- cc [~lucacanali] > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692233#comment-16692233 ] Ruslan Dautkhanov commented on SPARK-26019: --- Sorry, nope it was broken by this change - https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650#diff-c3339bbf2b850b79445b41e9eecf57c4R249 > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692226#comment-16692226 ] Ruslan Dautkhanov commented on SPARK-26019: --- Might be broken by https://github.com/apache/spark/pull/22635 change > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692219#comment-16692219 ] Ruslan Dautkhanov commented on SPARK-26019: --- [~hyukjin.kwon] today I reproduced this first time .. but we still receive reports from other our users as well. Here's code on Spark 2.3.2 + Python 2.7.15. Execute on a freshly created Spark session : {code:python} def python_major_version (): import sys return(sys.version_info[0]) print(python_major_version()) print(sc.parallelize([1]).map(lambda x: python_major_version()).collect())# error happens here ! {code} It always reproduces for me. Notice that just rerunning the same code makes this error disappear. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov reopened SPARK-26019: --- Reproduced myself > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26121) [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id)
Anastasios Zouzias created SPARK-26121: -- Summary: [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id) Key: SPARK-26121 URL: https://issues.apache.org/jira/browse/SPARK-26121 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Anastasios Zouzias I run in the following situation with Spark Structure Streaming (SS) using Kafka. In a project that I work on, there is already a secured Kafka setup where ops can issue an SSL certificate per "[group.id|http://group.id/];, which should be predefined (or its prefix to be predefined). On the other hand, Spark SS fixes the [group.id|http://group.id/] to val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}" see, i.e., [https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124] https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81 I guess Spark developers had a good reason to fix it, but is it possible to make configurable the prefix of the above uniqueGroupId ("spark-kafka-source")? The rational is that spark users are not forced to use the same certificate on group-ids of the form (spark-kafka-source-*). DoD: * Allow spark SS users to define the group.id prefix as input parameter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26120: Assignee: Shixiong Zhu (was: Apache Spark) > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
Shixiong Zhu created SPARK-26120: Summary: Fix a streaming query leak in Structured Streaming R tests Key: SPARK-26120 URL: https://issues.apache.org/jira/browse/SPARK-26120 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.4.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu "Specify a schema by using a DDL-formatted string when reading" doesn't stop the streaming query before stopping Spark. It causes the following annoying logs. {code} Exception in thread "stream execution thread for [id = 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) at org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) ... 7 more org.apache.spark.SparkException: Exception thrown in awaitResult: at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) at org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) ... 7 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692178#comment-16692178 ] Apache Spark commented on SPARK-26120: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/23089 > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26119: Assignee: (was: Apache Spark) > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692176#comment-16692176 ] Apache Spark commented on SPARK-26120: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/23089 > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26120: Assignee: Apache Spark (was: Shixiong Zhu) > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26119: Assignee: Apache Spark > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Assignee: Apache Spark >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692175#comment-16692175 ] Apache Spark commented on SPARK-26119: -- User 'shahidki31' has created a pull request for this issue: https://github.com/apache/spark/pull/23088 > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26120: - Priority: Minor (was: Major) > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests
[ https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-26120: - Component/s: Structured Streaming SparkR > Fix a streaming query leak in Structured Streaming R tests > -- > > Key: SPARK-26120 > URL: https://issues.apache.org/jira/browse/SPARK-26120 > Project: Spark > Issue Type: Test > Components: SparkR, Structured Streaming, Tests >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > "Specify a schema by using a DDL-formatted string when reading" doesn't stop > the streaming query before stopping Spark. It causes the following annoying > logs. > {code} > Exception in thread "stream execution thread for [id = > 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = > 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution > thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = > 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: > Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108) > at > org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204) > Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) > at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) > at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91) > ... 7 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
[ https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692093#comment-16692093 ] shahid commented on SPARK-26119: Thanks. I am working on it > Task metrics summary in the stage page should contain only successful tasks > metrics > --- > > Key: SPARK-26119 > URL: https://issues.apache.org/jira/browse/SPARK-26119 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.2 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Currently task metrics summary table in the stage tab shows summary > corresponds to all the tasks. But, we should display the summary of only > succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics
ABHISHEK KUMAR GUPTA created SPARK-26119: Summary: Task metrics summary in the stage page should contain only successful tasks metrics Key: SPARK-26119 URL: https://issues.apache.org/jira/browse/SPARK-26119 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 2.3.2 Reporter: ABHISHEK KUMAR GUPTA Currently task metrics summary table in the stage tab shows summary corresponds to all the tasks. But, we should display the summary of only succeeded tasks in the tasks summary metrics table -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692018#comment-16692018 ] Attila Zsolt Piros commented on SPARK-26118: I am working on a PR. > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
[ https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-26118: --- Affects Version/s: (was: 2.3.2) (was: 2.4.0) (was: 2.2.2) (was: 2.1.3) > Make Jetty's requestHeaderSize configurable in Spark > > > Key: SPARK-26118 > URL: https://issues.apache.org/jira/browse/SPARK-26118 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > For long authorization fields the request header size could be over the > default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request > Entity Too Large). > This issue may occur if the user is a member of many Active Directory user > groups. > The HTTP request to the server contains the Kerberos token in the > WWW-Authenticate header. The header size increases together with the number > of user groups. > Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark
Attila Zsolt Piros created SPARK-26118: -- Summary: Make Jetty's requestHeaderSize configurable in Spark Key: SPARK-26118 URL: https://issues.apache.org/jira/browse/SPARK-26118 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.0, 2.3.2, 2.2.2, 2.1.3, 3.0.0 Reporter: Attila Zsolt Piros For long authorization fields the request header size could be over the default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request Entity Too Large). This issue may occur if the user is a member of many Active Directory user groups. The HTTP request to the server contains the Kerberos token in the WWW-Authenticate header. The header size increases together with the number of user groups. Currently there is no way in Spark to override this limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691965#comment-16691965 ] Long Cao commented on SPARK-26026: -- [~srowen] thanks for the expedient resolution! It's a small thing to me but nice to see. > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23886) update query.status
[ https://issues.apache.org/jira/browse/SPARK-23886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691881#comment-16691881 ] Gabor Somogyi commented on SPARK-23886: --- Thanks [~efim.poberezkin]! May I ask you then to close your PRs for this Jira and for SPARK-24063? > update query.status > --- > > Key: SPARK-23886 > URL: https://issues.apache.org/jira/browse/SPARK-23886 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3
[ https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26090. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23065 [https://github.com/apache/spark/pull/23065] > Resolve most miscellaneous deprecation and build warnings for Spark 3 > - > > Key: SPARK-26090 > URL: https://issues.apache.org/jira/browse/SPARK-26090 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > The build has a lot of deprecation warnings. Some are new in Scala 2.12 and > Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy > miscellaneous ones here. > They're too numerous and small to list here; see the pull request. Some > highlights: > - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in > Java. Instead, case classes can explicitly declare getters > - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases > - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0 > - finalize() is finally deprecated (just needs to be suppressed) > - StageInfo.attempId was deprecated and easiest to remove here > I'm not now going to touch some chunks of deprecation warnings: > - Parquet deprecations > - Hive deprecations (particularly serde2 classes) > - Deprecations in generated code (mostly Thriftserver CLI) > - ProcessingTime deprecations (we may need to revive this class as internal) > - many MLlib deprecations because they concern methods that may be removed > anyway > - a few Kinesis deprecations I couldn't figure out > - Mesos get/setRole, which I don't know well > - Kafka/ZK deprecations (e.g. poll()) > - Kinesis > - a few other ones that will probably resolve by deleting a deprecated method -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691828#comment-16691828 ] Sunil Rangwani commented on SPARK-14492: [~ottomata] you are using a CDH version I believe. I don't know what dependencies their parcels include. This bug was originally found in the open source version built using maven as described here [https://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support] > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691811#comment-16691811 ] Sunil Rangwani commented on SPARK-14492: [~srowen] Please reopen this. It is not about building with varying versions of Hive. After building with the required version of Hive, it gives a runtime error - as explained in my earlier comment. > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25528) data source V2 API refactoring (batch read)
[ https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-25528: Summary: data source V2 API refactoring (batch read) (was: data source V2 read side API refactoring) > data source V2 API refactoring (batch read) > --- > > Key: SPARK-25528 > URL: https://issues.apache.org/jira/browse/SPARK-25528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > refactor the read side API according to this abstraction > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25528) data source V2 API refactoring (batch read)
[ https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691799#comment-16691799 ] Apache Spark commented on SPARK-25528: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23086 > data source V2 API refactoring (batch read) > --- > > Key: SPARK-25528 > URL: https://issues.apache.org/jira/browse/SPARK-25528 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > refactor the read side API according to this abstraction > {code} > batch: catalog -> table -> scan > streaming: catalog -> table -> stream -> scan > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26071) disallow map as map key
[ https://issues.apache.org/jira/browse/SPARK-26071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26071. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23045 [https://github.com/apache/spark/pull/23045] > disallow map as map key > --- > > Key: SPARK-26071 > URL: https://issues.apache.org/jira/browse/SPARK-26071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version
[ https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691787#comment-16691787 ] Andrew Otto commented on SPARK-14492: - I found my issue: we were loading some Hive 1.1.0 jars manually using spark.driver.extraClassPath in order to initiate a Hive JDBC directly to Hive, instead of using spark.sql() to work around https://issues.apache.org/jira/browse/SPARK-23890. The Hive 1.1.0 classes were loaded before the ones included with Spark, and as such they failed referencing a Hive configuration that didn't exist in 1.1.0. > Spark SQL 1.6.0 does not work with external Hive metastore version lower than > 1.2.0; its not backwards compatible with earlier version > -- > > Key: SPARK-14492 > URL: https://issues.apache.org/jira/browse/SPARK-14492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Sunil Rangwani >Priority: Critical > > Spark SQL when configured with a Hive version lower than 1.2.0 throws a > java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME > because this field was introduced in Hive 1.2.0 so its not possible to use > Hive metastore version lower than 1.2.0 with Spark. The details of the Hive > changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 > {code:java} > Exception in thread "main" java.lang.NoSuchFieldError: > METASTORE_CLIENT_SOCKET_LIFETIME > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237) > at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272) > at > org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at org.apache.spark.sql.SQLContext.(SQLContext.scala:271) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90) > at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26112) Update since versions of new built-in functions.
[ https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26112. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23082 [https://github.com/apache/spark/pull/23082] > Update since versions of new built-in functions. > > > Key: SPARK-26112 > URL: https://issues.apache.org/jira/browse/SPARK-26112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > The following 5 functions were removed from branch-2.4: > - map_entries > - map_filter > - transform_values > - transform_keys > - map_zip_with > We should update the since version to 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour
[ https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26024: --- Assignee: Julien Peloton > Dataset API: repartitionByRange(...) has inconsistent behaviour > --- > > Key: SPARK-26024 > URL: https://issues.apache.org/jira/browse/SPARK-26024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Spark version 2.3.2 >Reporter: Julien Peloton >Assignee: Julien Peloton >Priority: Major > Labels: dataFrame, partitioning, repartition, spark-sql > Fix For: 3.0.0 > > > Hi, > I recently played with the {{repartitionByRange}} method for DataFrame > introduced in SPARK-22614. For DataFrames larger than the one tested in the > code (which has only 10 elements), the code sends back random results. > As a test for showing the inconsistent behaviour, I start as the unit code > used to test {{repartitionByRange}} > ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352]) > but I increase the size of the initial array to 1000, repartition using 3 > partitions, and count the number of element per-partitions: > > {code} > // Shuffle numbers from 0 to 1000, and make a DataFrame > val df = Random.shuffle(0.to(1000)).toDF("val") > // Repartition it using 3 partitions > // Sum up number of elements in each partition, and collect it. > // And do it several times > for (i <- 0 to 9) { > var counts = df.repartitionByRange(3, col("val")) > .mapPartitions{part => Iterator(part.size)} > .collect() > println(counts.toList) > } > // -> the number of elements in each partition varies... > {code} > I do not know whether it is expected (I will dig further in the code), but it > sounds like a bug. > Or I just misinterpret what {{repartitionByRange}} is for? > Any ideas? > Thanks! > Julien -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour
[ https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26024. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23025 [https://github.com/apache/spark/pull/23025] > Dataset API: repartitionByRange(...) has inconsistent behaviour > --- > > Key: SPARK-26024 > URL: https://issues.apache.org/jira/browse/SPARK-26024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Environment: Spark version 2.3.2 >Reporter: Julien Peloton >Priority: Major > Labels: dataFrame, partitioning, repartition, spark-sql > Fix For: 3.0.0 > > > Hi, > I recently played with the {{repartitionByRange}} method for DataFrame > introduced in SPARK-22614. For DataFrames larger than the one tested in the > code (which has only 10 elements), the code sends back random results. > As a test for showing the inconsistent behaviour, I start as the unit code > used to test {{repartitionByRange}} > ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352]) > but I increase the size of the initial array to 1000, repartition using 3 > partitions, and count the number of element per-partitions: > > {code} > // Shuffle numbers from 0 to 1000, and make a DataFrame > val df = Random.shuffle(0.to(1000)).toDF("val") > // Repartition it using 3 partitions > // Sum up number of elements in each partition, and collect it. > // And do it several times > for (i <- 0 to 9) { > var counts = df.repartitionByRange(3, col("val")) > .mapPartitions{part => Iterator(part.size)} > .collect() > println(counts.toList) > } > // -> the number of elements in each partition varies... > {code} > I do not know whether it is expected (I will dig further in the code), but it > sounds like a bug. > Or I just misinterpret what {{repartitionByRange}} is for? > Any ideas? > Thanks! > Julien -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26112) Update since versions of new built-in functions.
[ https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26112: --- Assignee: Takuya Ueshin > Update since versions of new built-in functions. > > > Key: SPARK-26112 > URL: https://issues.apache.org/jira/browse/SPARK-26112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0 > > > The following 5 functions were removed from branch-2.4: > - map_entries > - map_filter > - transform_values > - transform_keys > - map_zip_with > We should update the since version to 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk
[ https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26068: --- Assignee: Liu, Linhong > ChunkedByteBufferInputStream is truncated by empty chunk > > > Key: SPARK-26068 > URL: https://issues.apache.org/jira/browse/SPARK-26068 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liu, Linhong >Assignee: Liu, Linhong >Priority: Major > Fix For: 3.0.0 > > > If ChunkedByteBuffer contains empty chunk in the middle of it, then the > ChunkedByteBufferInputStream will be truncated. All data behind the empty > chunk will not be read. > The problematic code: > {code:java} > // ChunkedByteBuffer.scala > // Assume chunks.next returns an empty chunk, then we will reach > // else branch no matter chunks.hasNext = true or not. So some data is lost. > override def read(dest: Array[Byte], offset: Int, length: Int): Int = { > if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext) > { > currentChunk = chunks.next() > } > if (currentChunk != null && currentChunk.hasRemaining) { > val amountToGet = math.min(currentChunk.remaining(), length) > currentChunk.get(dest, offset, amountToGet) > amountToGet > } else { > close() > -1 > } > } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk
[ https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26068. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23040 [https://github.com/apache/spark/pull/23040] > ChunkedByteBufferInputStream is truncated by empty chunk > > > Key: SPARK-26068 > URL: https://issues.apache.org/jira/browse/SPARK-26068 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liu, Linhong >Priority: Major > Fix For: 3.0.0 > > > If ChunkedByteBuffer contains empty chunk in the middle of it, then the > ChunkedByteBufferInputStream will be truncated. All data behind the empty > chunk will not be read. > The problematic code: > {code:java} > // ChunkedByteBuffer.scala > // Assume chunks.next returns an empty chunk, then we will reach > // else branch no matter chunks.hasNext = true or not. So some data is lost. > override def read(dest: Array[Byte], offset: Int, length: Int): Int = { > if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext) > { > currentChunk = chunks.next() > } > if (currentChunk != null && currentChunk.hasRemaining) { > val amountToGet = math.min(currentChunk.remaining(), length) > currentChunk.get(dest, offset, amountToGet) > amountToGet > } else { > close() > -1 > } > } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691725#comment-16691725 ] Apache Spark commented on SPARK-26117: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/23084 > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Priority: Major > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-26043: - Assignee: Sean Owen > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26043. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23066 [https://github.com/apache/spark/pull/23066] > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26026. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23069 [https://github.com/apache/spark/pull/23069] > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Assignee: Sean Owen >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26117: Assignee: (was: Apache Spark) > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Priority: Major > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
[ https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26117: Assignee: Apache Spark > use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception > -- > > Key: SPARK-26117 > URL: https://issues.apache.org/jira/browse/SPARK-26117 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.5.0 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Major > > the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the > entire executor when an OutOfMemoryError is thrown. > so apply for memory using MemoryConsumer. allocatePage when catch exception, > use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
caoxuewen created SPARK-26117: - Summary: use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception Key: SPARK-26117 URL: https://issues.apache.org/jira/browse/SPARK-26117 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 2.5.0 Reporter: caoxuewen the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the entire executor when an OutOfMemoryError is thrown. so apply for memory using MemoryConsumer. allocatePage when catch exception, use SparkOutOfMemoryError instead of OutOfMemoryError -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Zhemzhitsky updated SPARK-26114: --- Description: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memoryOverhead=512 \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. Here is dominator tree from the heapdump !run1-noparams-dominator-tree.png|width=700! 4 instances of ExternalSorter, although there are only 2 concurrently running tasks per executor. !run1-noparams-dominator-tree-externalsorter.png|width=700! And paths to GC root of the already stopped ExternalSorter. !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! was: Trying to use _coalesce_ after shuffle-oriented transformations leads to OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X GB of Y GB physical memory used. Consider boostingspark.yarn.executor.memoryOverhead_. Discussion is [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. The error happens when trying specify pretty small number of partitions in _coalesce_ call. *How to reproduce?* # Start spark-shell {code:bash} spark-shell \ --num-executors=5 \ --executor-cores=2 \ --master=yarn \ --deploy-mode=client \ --conf spark.executor.memory=1g \ --conf spark.dynamicAllocation.enabled=false \ --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' {code} Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap memory usage seems to be the only way to control the amount of memory used for shuffle data transferring by now. Also note that the total number of cores allocated for job is 5x2=10 # Then generate some test data {code:scala} import org.apache.hadoop.io._ import org.apache.hadoop.io.compress._ import org.apache.commons.lang._ import org.apache.spark._ // generate 100M records of sample data sc.makeRDD(1 to 1000, 1000) .flatMap(item => (1 to 10) .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> new Text(RandomStringUtils.randomAlphanumeric(1024 .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) {code} # Run the sample job {code:scala} import org.apache.hadoop.io._ import org.apache.spark._ import org.apache.spark.storage._ val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) rdd .map(item => item._1.toString -> item._2.toString) .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) .coalesce(10,false) .count {code} Note that the number of partitions is equal to the total number of cores allocated to the job. Here is dominator tree from the heapdump !run1-noparams-dominator-tree.png|width=700! 4 instances of ExternalSorter, although there are only 2 concurrently running tasks per executor. !run1-noparams-dominator-tree-externalsorter.png|width=700!
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also forces to provision huge amounts of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related and
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also forces to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related and
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : Spark-12546 is explicitly related
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Summary: Spark SQL - Sort when writing partitioned parquet leads to OOM errors (was: Spark SQL - Sort when writing partitioned parquet lead to OOM errors) > Spark SQL - Sort when writing partitioned parquet leads to OOM errors > - > > Key: SPARK-26116 > URL: https://issues.apache.org/jira/browse/SPARK-26116 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Pierre Lienhart >Priority: Major > > When writing partitioned parquet using {{partitionBy}}, it looks like Spark > sorts each partition before writing but this sort consumes a huge amount of > memory compared to the size of the data. The executors can then go OOM and > get killed by YARN. As a consequence, it also force to provision huge amount > of memory compared to the data to be written. > Error messages found in the Spark UI are like the following : > {code:java} > Spark UI description of failure : Job aborted due to stage failure: Task 169 > in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage > 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure > (executor 1 exited caused by one of the running tasks) Reason: Container > killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory > used. Consider boosting spark.yarn.executor.memoryOverhead. > {code} > > {code:java} > Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most > recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, > executor 1): org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.OutOfMemoryError: error while calling spill() on > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : > /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad > (No such file or directory) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > ... 8 more{code} > > In the stderr logs, we can see that huge amount of sort data (the partition > being sorted here is 250 MB when persisted into memory, deserialized) is > being spilled to the disk ({{INFO
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related and
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue : SPARK-12546 is explicitly related
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.}}) but sometimes it does not and we see multiple {{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} until the application finally runs OOM with logs such as {{ERROR UnsafeExternalSorter: Unable to grow the pointer array}}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks like a known issue :
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {{partitionBy}}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.{code}) but sometimes it does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.{code} until the application finally runs OOM with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the pointer array{code}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0. It looks
[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
[ https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pierre Lienhart updated SPARK-26116: Description: When writing partitioned parquet using {code:java}partitionBy{code}, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : {code:java} Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. {code} {code:java} Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more{code} In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.{code}) but sometimes it does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.{code} until the application finally runs OOM with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the pointer array{code}. I should mention that when looking at individual (successful) write tasks in the Spark UI, the Peak Execution Memory metric is always 0.
[jira] [Created] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors
Pierre Lienhart created SPARK-26116: --- Summary: Spark SQL - Sort when writing partitioned parquet lead to OOM errors Key: SPARK-26116 URL: https://issues.apache.org/jira/browse/SPARK-26116 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Pierre Lienhart When writing partitioned parquet using ```partitionBy```, it looks like Spark sorts each partition before writing but this sort consumes a huge amount of memory compared to the size of the data. The executors can then go OOM and get killed by YARN. As a consequence, it also force to provision huge amount of memory compared to the data to be written. Error messages found in the Spark UI are like the following : ``` Spark UI description of failure : Job aborted due to stage failure: Task 169 in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. ``` ``` Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, executor 1): org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad (No such file or directory) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254) at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160) at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) ... 8 more ``` In the stderr logs, we can see that huge amount of sort data (the partition being sorted here is 250 MB when persisted into memory, deserialized) is being spilled to the disk (```INFO UnsafeExternalSorter: Thread 155 spilling sort data of 3.6 GB to disk```). Sometimes the data is spilled in time to the disk and the sort completes (```INFO FileFormatWriter: Sorting complete. Writing out partition files one at a time.```) but sometimes it does not and we see multiple ```TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.``` until the application finally runs OOM with logs such as ```ERROR UnsafeExternalSorter: Unable to grow the pointer array```. I should mention
[jira] [Resolved] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module
[ https://issues.apache.org/jira/browse/SPARK-26115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-26115. Resolution: Duplicate Duplicated to SPARK-2609 > Fix deprecated warnings related to scala 2.12 in core module > > > Key: SPARK-26115 > URL: https://issues.apache.org/jira/browse/SPARK-26115 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Minor > > {quote}[warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => SparkContext.this.reportHeartBeat()). > [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, > "driver-heartbeater", > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()). > [warn] val providers = Seq(new > HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++ > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => Executor.this.reportHeartBeat()). > [warn] private val heartbeater = new Heartbeater(env.memoryManager, > reportHeartBeat, > [warn]^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102: > method childGroup in class ServerBootstrap is deprecated: see corresponding > Javadoc for more information. > [warn] if (bootstrap != null && bootstrap.childGroup() != null) { > [warn]^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] new ForkJoinTaskSupport(new ForkJoinPool(8)) > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59: > value attemptId in class StageInfo is deprecated (since 2.3.0): Use > attemptNumber instead > [warn] def attemptNumber(): Int = attemptId > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] def newForkJoinPool(prefix: String, maxThreadNumber: Int): > SForkJoinPool = { > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] override def newThread(pool: SForkJoinPool) = > [warn]^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190: > type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): > use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias > [warn] new SForkJoinWorkerThread(pool) { > [warn] ^ > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194: > type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use > java.util.concurrent.ForkJoinPool directly, instead of this alias > [warn] new SForkJoinPool(maxThreadNumber, factory, > [warn] ^ > [warn] 10 warnings found > [warn] > /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: > Eta-expansion of zero-argument methods is deprecated. To avoid this warning, > write (() => SparkContext.this.reportHeartBeat()). > [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, > "driver-heartbeater", > [warn] > [warn] >
[jira] [Created] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module
Gengliang Wang created SPARK-26115: -- Summary: Fix deprecated warnings related to scala 2.12 in core module Key: SPARK-26115 URL: https://issues.apache.org/jira/browse/SPARK-26115 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Gengliang Wang {quote}[warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => SparkContext.this.reportHeartBeat()). [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, "driver-heartbeater", [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()). [warn] val providers = Seq(new HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++ [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => Executor.this.reportHeartBeat()). [warn] private val heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, [warn]^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102: method childGroup in class ServerBootstrap is deprecated: see corresponding Javadoc for more information. [warn] if (bootstrap != null && bootstrap.childGroup() != null) { [warn]^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] new ForkJoinTaskSupport(new ForkJoinPool(8)) [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59: value attemptId in class StageInfo is deprecated (since 2.3.0): Use attemptNumber instead [warn] def attemptNumber(): Int = attemptId [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] def newForkJoinPool(prefix: String, maxThreadNumber: Int): SForkJoinPool = { [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] override def newThread(pool: SForkJoinPool) = [warn]^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190: type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias [warn] new SForkJoinWorkerThread(pool) { [warn] ^ [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194: type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use java.util.concurrent.ForkJoinPool directly, instead of this alias [warn] new SForkJoinPool(maxThreadNumber, factory, [warn] ^ [warn] 10 warnings found [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => SparkContext.this.reportHeartBeat()). [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, "driver-heartbeater", [warn] [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273: Eta-expansion of zero-argument methods is deprecated. To avoid this warning, write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()). [warn] val providers = Seq(new HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++ [warn] [warn] /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102: method childGroup in class
[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions
[ https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26114: Assignee: (was: Apache Spark) > Memory leak of PartitionedPairBuffer when coalescing after > repartitionAndSortWithinPartitions > - > > Key: SPARK-26114 > URL: https://issues.apache.org/jira/browse/SPARK-26114 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2, 2.4.0 > Environment: Spark 3.0.0-SNAPSHOT (master branch) > Scala 2.11 > Yarn 2.7 >Reporter: Sergey Zhemzhitsky >Priority: Major > Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, > run1-noparams-dominator-tree-externalsorter.png, > run1-noparams-dominator-tree.png > > > Trying to use _coalesce_ after shuffle-oriented transformations leads to > OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X > GB of Y GB physical memory used. Consider > boostingspark.yarn.executor.memoryOverhead_. > Discussion is > [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html]. > The error happens when trying specify pretty small number of partitions in > _coalesce_ call. > *How to reproduce?* > # Start spark-shell > {code:bash} > spark-shell \ > --num-executors=5 \ > --executor-cores=2 \ > --master=yarn \ > --deploy-mode=client \ > --conf spark.executor.memory=1g \ > --conf spark.dynamicAllocation.enabled=false \ > --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true' > {code} > Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap > memory usage seems to be the only way to control the amount of memory used > for shuffle data transferring by now. > Also note that the total number of cores allocated for job is 5x2=10 > # Then generate some test data > {code:scala} > import org.apache.hadoop.io._ > import org.apache.hadoop.io.compress._ > import org.apache.commons.lang._ > import org.apache.spark._ > // generate 100M records of sample data > sc.makeRDD(1 to 1000, 1000) > .flatMap(item => (1 to 10) > .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) > -> new Text(RandomStringUtils.randomAlphanumeric(1024 > .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) > {code} > # Run the sample job > {code:scala} > import org.apache.hadoop.io._ > import org.apache.spark._ > import org.apache.spark.storage._ > val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text]) > rdd > .map(item => item._1.toString -> item._2.toString) > .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) > .coalesce(10,false) > .count > {code} > Note that the number of partitions is equal to the total number of cores > allocated to the job. > Here is dominator tree from the heapdump > !run1-noparams-dominator-tree.png|width=700! > 4 instances of ExternalSorter, although there are only 2 concurrently running > tasks per executor. > !run1-noparams-dominator-tree-externalsorter.png|width=700! > And paths to GC root of the already stopped ExternalSorter. > !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org