date:20181119

[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26019.
--
Resolution: Cannot Reproduce

Reopen this if this issue is persistent in Apache master, or if the analysis 
for it is very clear. I vaguely suspect it's somehow related with Cloudera's 
parcel. If the usecase is strong enough (I guess so), and we're clear on what's 
the cause, switching two lines should be fine.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692778#comment-16692778
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

Also, I followed the guide:

{quote}
For issues that can’t be reproduced against master as reported, resolve as 
Cannot Reproduce
{quote}

You missed this.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692773#comment-16692773
 ] 

Hyukjin Kwon edited comment on SPARK-26019 at 11/20/18 7:29 AM:


[~Tagar], can you reproduce this in Apache master?

Also, we don't know why it causes the error. I also checked what [~viirya] 
checked and it doesn't make sense that's the cause.


was (Author: hyukjin.kwon):
[~Tagar], can you reproduce this in Apache master?

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692773#comment-16692773
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

[~Tagar], can you reproduce this in Apache master?

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692769#comment-16692769
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

I can't reproduce the issue in my local and yarn cluster. That's why I resolved 
this JIRA. If it's a Cloudera distribution, please use a channel for Cloudera.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692722#comment-16692722
 ] 

Liang-Chi Hsieh commented on SPARK-26019:
-

{{TCPServer}} begins to process requests only after we call its 
{{serve_forever}} method. When the server goes to process a request, it will 
instantiate {{RequestHandlerClass}}:

https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L332

As {{RequestHandlerClass}} in {{AccumulatorServer}} is 
{{_UpdateRequestHandler}}, at the moment, {{_UpdateRequestHandler.handle}} is 
called and to access {{auth_token}}.

All of above should be happened after we call {{serve_forever}} in 
{{_start_update_server}}:

{code}
def _start_update_server(auth_token):
"""Start a TCP server to receive accumulator updates in a daemon thread, 
and returns it"""
server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler, 
auth_token)
thread = threading.Thread(target=server.serve_forever)
{code}

And I think that we already finish instantiating {{AccumulatorServer}} and 
{{self.auth_token = auth_token}} is run.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26126) Should put scala-library deps into root pom instead of spark-tags module

2018-11-19 Thread liupengcheng (JIRA)

liupengcheng created SPARK-26126:


 Summary: Should put scala-library deps into root pom instead of 
spark-tags module
 Key: SPARK-26126
 URL: https://issues.apache.org/jira/browse/SPARK-26126
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.3.0, 2.1.0
Reporter: liupengcheng


When I do some backport in our custom spark, I notice some strange code from 
spark-tags module:
{code:java}

  
org.scala-lang
scala-library
${scala.version}
  

{code}
As i known, should spark-tags only contains some annotation related classes or 
deps?

should we put the scala-library deps to root pom?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-26019:
---

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692716#comment-16692716
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~viirya] exception stack reads that error happened in SocketServer.py, 
BaseRequestHandler class constructor, excerpt from the full exception stack 
above :

{code:python}
...
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
652, in __init__
self.handle()
...
{code}

Notice constructor here calls `self.handle()`  - 
https://github.com/python/cpython/blob/2.7/Lib/SocketServer.py#L655

`handle()` is defined in derived class _UpdateRequestHandler here 
https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L232
and expects `auth_token` to be set :
https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L254 
- that's exactly where exception happens. 

[~irashid] was right - those two lines have to be swapped.

[~hyukjin.kwon] that's odd you closed this jira, although I said it always 
reproduces for me (100 % of times ), 
and even [posted reproducer 
here|https://issues.apache.org/jira/secure/EditComment!default.jspa?id=13197858=16692219].
[~saivarunvishal] also said it happens for him in SPARK-26113 and you closed 
that jira as well. 
It seems not in line with https://spark.apache.org/contributing.html - 
"Contributing Bug Reports". Please let me know what I miss here.

I called out [~bersprockets] because we use Cloudera distribution of Spark and 
Cloudera has a few patches on top of open-source Spark. 
I wanted to make sure it's not Cloudera distro specific. Also we worked with 
Bruce on several other Spark issue and noticed here's in watchers list on this 
jira... Now I see that this issue is not Cloudera specific though. 



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26125) Delegation Token seems not appropriately stored on secrets of Kubernetes/Kerberized HDFS

2018-11-19 Thread Kei Kori (JIRA)

Kei Kori created SPARK-26125:


 Summary: Delegation Token seems not appropriately stored on 
secrets of Kubernetes/Kerberized HDFS
 Key: SPARK-26125
 URL: https://issues.apache.org/jira/browse/SPARK-26125
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Kei Kori
 Attachments: spark-submit-stern.log

I tried Kerberos authentication with Kubernetes Resource Manager and an 
external Hadoop and KDC.
I tested built on 
[6c9c84f|https://github.com/apache/spark/commit/6c9c84ffb9c8d98ee2ece7ba4b010856591d383d]
 (master + SPARK-23257).

{code}
$ bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.HdfsTest \
  --master k8s://https://master01.node:6443 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.app.name=spark-hdfs \
  --conf spark.executer.instances=1 \
  --conf 
spark.kubernetes.container.image=docker-registry/kkori/spark:6c9c84f \
  --conf spark.kubernetes.kerberos.enabled=true \
  --conf spark.kubernetes.kerberos.krb5.configMapName=krb5-conf \
  --conf spark.kubernetes.kerberos.keytab=/tmp/test.keytab \
  --conf 
spark.kubernetes.kerberos.principal=t...@external.kerberos.realm.com \
  --conf spark.kubernetes.hadoop.configMapName=hadoop-conf \
  local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar
{code}

I successfully submitted into Kubernetes RM and Kubernetes spawned spark-driver 
and executors,
but Hadoop Delegation Token seems wrongly stored into Kubernetes secrets, since 
that contains only header like below:

{code}
$ kubectl get secrets spark-hdfs-1542613661459-delegation-tokens -o 
jsonpath='{.data.hadoop-tokens}' | {base64 -d | cat -A; echo;}
HDTS^@^@^@
{code}

The result of "kubectl get secrets" should be like folloing(I masked the actual 
result):
{code}
HDTS^@^ha-hdfs:test^@^_t...@external.kerberos.realm.com^@^@
{code}


As a result, spark-driver threw GSSException for each access of HDFS.
Full logs(submit, driver, executor) are attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26125) Delegation Token seems not appropriately stored on secrets of Kubernetes/Kerberized HDFS

2018-11-19 Thread Kei Kori (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kei Kori updated SPARK-26125:
-
Attachment: spark-submit-stern.log

> Delegation Token seems not appropriately stored on secrets of 
> Kubernetes/Kerberized HDFS
> 
>
> Key: SPARK-26125
> URL: https://issues.apache.org/jira/browse/SPARK-26125
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Kei Kori
>Priority: Minor
> Attachments: spark-submit-stern.log
>
>
> I tried Kerberos authentication with Kubernetes Resource Manager and an 
> external Hadoop and KDC.
> I tested built on 
> [6c9c84f|https://github.com/apache/spark/commit/6c9c84ffb9c8d98ee2ece7ba4b010856591d383d]
>  (master + SPARK-23257).
> {code}
> $ bin/spark-submit \
>   --deploy-mode cluster \
>   --class org.apache.spark.examples.HdfsTest \
>   --master k8s://https://master01.node:6443 \
>   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
>   --conf spark.app.name=spark-hdfs \
>   --conf spark.executer.instances=1 \
>   --conf 
> spark.kubernetes.container.image=docker-registry/kkori/spark:6c9c84f \
>   --conf spark.kubernetes.kerberos.enabled=true \
>   --conf spark.kubernetes.kerberos.krb5.configMapName=krb5-conf \
>   --conf spark.kubernetes.kerberos.keytab=/tmp/test.keytab \
>   --conf 
> spark.kubernetes.kerberos.principal=t...@external.kerberos.realm.com \
>   --conf spark.kubernetes.hadoop.configMapName=hadoop-conf \
>   local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar
> {code}
> I successfully submitted into Kubernetes RM and Kubernetes spawned 
> spark-driver and executors,
> but Hadoop Delegation Token seems wrongly stored into Kubernetes secrets, 
> since that contains only header like below:
> {code}
> $ kubectl get secrets spark-hdfs-1542613661459-delegation-tokens -o 
> jsonpath='{.data.hadoop-tokens}' | {base64 -d | cat -A; echo;}
> HDTS^@^@^@
> {code}
> The result of "kubectl get secrets" should be like folloing(I masked the 
> actual result):
> {code}
> HDTS^@^ha-hdfs:test^@^_t...@external.kerberos.realm.com^@^@
> {code}
> As a result, spark-driver threw GSSException for each access of HDFS.
> Full logs(submit, driver, executor) are attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25959) Difference in featureImportances results on computed vs saved models

2018-11-19 Thread Dagang Wei (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692567#comment-16692567
 ] 

Dagang Wei commented on SPARK-25959:


Can we backport the fix "[SPARK-25959][ML] GBTClassifier picks wrong impurity 
stats on loading" (e00cac9) to Spark 2.2+? I tried to cherry-pick it to 2.2, 
but there are 2 conflicts I don't know how to resolve correctly:

 

both modified: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
 both modified: project/MimaExcludes.scala

> Difference in featureImportances results on computed vs saved models
> 
>
> Key: SPARK-25959
> URL: https://issues.apache.org/jira/browse/SPARK-25959
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Suraj Nayak
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> I tried to implement GBT and found that the feature Importance computed while 
> the model was fit is different when the same model was saved into a storage 
> and loaded back. 
>  
> I also found that once the persistent model is loaded and saved back again 
> and loaded, the feature importance remains the same. 
>  
> Not sure if its bug while storing and reading the model first time or am 
> missing some parameter that need to be set before saving the model (thus 
> model is picking some defaults - causing feature importance to change)
>  
> *Below is the test code:*
> val testDF = Seq(
> (1, 3, 2, 1, 1),
> (3, 2, 1, 2, 0),
> (2, 2, 1, 1, 0),
> (3, 4, 2, 2, 0),
> (2, 2, 1, 3, 1)
> ).toDF("a", "b", "c", "d", "e")
> val featureColumns = testDF.columns.filter(_ != "e")
> // Assemble the features into a vector
> val assembler = new 
> VectorAssembler().setInputCols(featureColumns).setOutputCol("features")
> // Transform the data to get the feature data set
> val featureDF = assembler.transform(testDF)
> // Train a GBT model.
> val gbt = new GBTClassifier()
> .setLabelCol("e")
> .setFeaturesCol("features")
> .setMaxDepth(2)
> .setMaxBins(5)
> .setMaxIter(10)
> .setSeed(10)
> .fit(featureDF)
> gbt.transform(featureDF).show(false)
> // Write out the model
> featureColumns.zip(gbt.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* Prints
> (d,0.5931875075767403)
> (a,0.3747184548362353)
> (b,0.03209403758702444)
> (c,0.0)
> */
> gbt.write.overwrite().save("file:///tmp/test123")
> println("Reading model again")
> val gbtload = GBTClassificationModel.load("file:///tmp/test123")
> featureColumns.zip(gbtload.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /*
> Prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */
> gbtload.write.overwrite().save("file:///tmp/test123_rewrite")
> val gbtload2 = GBTClassificationModel.load("file:///tmp/test123_rewrite")
> featureColumns.zip(gbtload2.featureImportances.toArray).sortBy(-_._2).take(20).foreach(println)
> /* prints
> (d,0.6455841215290767)
> (a,0.3316126797964181)
> (b,0.022803198674505094)
> (c,0.0)
> */



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name

2018-11-19 Thread Takeshi Yamamuro (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692546#comment-16692546
 ] 

Takeshi Yamamuro commented on SPARK-26077:
--

Thanks for your reporting! It seems the master doesn't quote a JDBC table name 
now. Are u interested in the contribution on this? I think this is a kind of 
starter issues, maybe...

> Reserved SQL words are not escaped by JDBC writer for table name
> 
>
> Key: SPARK-26077
> URL: https://issues.apache.org/jira/browse/SPARK-26077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eugene Golovan
>Priority: Major
>
> This bug is similar to SPARK-16387 but this time table name is not escaped.
> How to reproduce:
> 1/ Start spark shell with mysql connector
> spark-shell --jars ./mysql-connector-java-8.0.13.jar
>  
> 2/ Execute next code
>  
> import spark.implicits._
> (spark
> .createDataset(Seq("a","b","c"))
> .toDF("order")
> .write
> .format("jdbc")
> .option("url", s"jdbc:mysql://root@localhost:3306/test")
> .option("driver", "com.mysql.cj.jdbc.Driver")
> .option("dbtable", "condition")
> .save)
>  
> , where condition - is reserved word.
>  
> Error message:
>  
> java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check 
> the manual that corresponds to your MySQL server version for the right syntax 
> to use near 'condition (`order` TEXT )' at line 1
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128)
>  at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  ... 59 elided
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26124) Update plugins, including MiMa

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26124:


Assignee: Apache Spark  (was: Sean Owen)

> Update plugins, including MiMa
> --
>
> Key: SPARK-26124
> URL: https://issues.apache.org/jira/browse/SPARK-26124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> For Spark 3, we should update plugins to their latest version where possible, 
> to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, 
> though that introduces some new errors on old changes due to some changes in 
> MiMa.
> Most SBT plugins can't really be updated further without updating to SBT 1.x, 
> and that will require some changes to the build, and it generally seems like 
> all of these new versions are for Scala 2.12+, including the new zinc. That 
> will probably be a bigger change but only after deciding to drop Scala 2.11 
> support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26124) Update plugins, including MiMa

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26124:


Assignee: Sean Owen  (was: Apache Spark)

> Update plugins, including MiMa
> --
>
> Key: SPARK-26124
> URL: https://issues.apache.org/jira/browse/SPARK-26124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> For Spark 3, we should update plugins to their latest version where possible, 
> to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, 
> though that introduces some new errors on old changes due to some changes in 
> MiMa.
> Most SBT plugins can't really be updated further without updating to SBT 1.x, 
> and that will require some changes to the build, and it generally seems like 
> all of these new versions are for Scala 2.12+, including the new zinc. That 
> will probably be a bigger change but only after deciding to drop Scala 2.11 
> support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26124) Update plugins, including MiMa

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692539#comment-16692539
 ] 

Apache Spark commented on SPARK-26124:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23087

> Update plugins, including MiMa
> --
>
> Key: SPARK-26124
> URL: https://issues.apache.org/jira/browse/SPARK-26124
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> For Spark 3, we should update plugins to their latest version where possible, 
> to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, 
> though that introduces some new errors on old changes due to some changes in 
> MiMa.
> Most SBT plugins can't really be updated further without updating to SBT 1.x, 
> and that will require some changes to the build, and it generally seems like 
> all of these new versions are for Scala 2.12+, including the new zinc. That 
> will probably be a bigger change but only after deciding to drop Scala 2.11 
> support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26124) Update plugins, including MiMa

2018-11-19 Thread Sean Owen (JIRA)

Sean Owen created SPARK-26124:
-

 Summary: Update plugins, including MiMa
 Key: SPARK-26124
 URL: https://issues.apache.org/jira/browse/SPARK-26124
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


For Spark 3, we should update plugins to their latest version where possible, 
to pick up miscellaneous fixes. In particular we can update MiMa to 0.3.0, 
though that introduces some new errors on old changes due to some changes in 
MiMa.

Most SBT plugins can't really be updated further without updating to SBT 1.x, 
and that will require some changes to the build, and it generally seems like 
all of these new versions are for Scala 2.12+, including the new zinc. That 
will probably be a bigger change but only after deciding to drop Scala 2.11 
support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26107) Extend ReplaceNullWithFalseInPredicate to support higher-order functions: ArrayExists, ArrayFilter, MapFilter

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26107.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23079
[https://github.com/apache/spark/pull/23079]

> Extend ReplaceNullWithFalseInPredicate to support higher-order functions: 
> ArrayExists, ArrayFilter, MapFilter
> -
>
> Key: SPARK-26107
> URL: https://issues.apache.org/jira/browse/SPARK-26107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.0.0
>
>
> Extend the {{ReplaceNullWithFalse}} optimizer rule introduced in SPARK-25860 
> to also support optimizing predicates in higher-order functions of 
> {{ArrayExists}}, {{ArrayFilter}}, {{MapFilter}}.
> Also rename the rule to {{ReplaceNullWithFalseInPredicate}} to better reflect 
> its intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26107) Extend ReplaceNullWithFalseInPredicate to support higher-order functions: ArrayExists, ArrayFilter, MapFilter

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26107:
---

Assignee: Kris Mok

> Extend ReplaceNullWithFalseInPredicate to support higher-order functions: 
> ArrayExists, ArrayFilter, MapFilter
> -
>
> Key: SPARK-26107
> URL: https://issues.apache.org/jira/browse/SPARK-26107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Assignee: Kris Mok
>Priority: Major
> Fix For: 3.0.0
>
>
> Extend the {{ReplaceNullWithFalse}} optimizer rule introduced in SPARK-25860 
> to also support optimizing predicates in higher-order functions of 
> {{ArrayExists}}, {{ArrayFilter}}, {{MapFilter}}.
> Also rename the rule to {{ReplaceNullWithFalseInPredicate}} to better reflect 
> its intent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Liang-Chi Hsieh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692504#comment-16692504
 ] 

Liang-Chi Hsieh commented on SPARK-26019:
-

Isn't the server beginning to handle requests after calling {{serve_forever}}? 
It is called after {{AccumulatorServer}} initialization.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692498#comment-16692498
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

Please reopen if this is still being reproduced. Or if there's any analysis for 
this issue about why this happens. Swapping two lines should be fine but I 
don't understand why and how.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25919) Date value corrupts when tables are "ParquetHiveSerDe" formatted and target table is Partitioned

2018-11-19 Thread Pawan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pawan updated SPARK-25919:
--
Affects Version/s: 2.1.0

> Date value corrupts when tables are "ParquetHiveSerDe" formatted and target 
> table is Partitioned
> 
>
> Key: SPARK-25919
> URL: https://issues.apache.org/jira/browse/SPARK-25919
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0, 2.2.1
>Reporter: Pawan
>Priority: Major
>
> Hi
> I found a really strange issue. Below are the steps to reproduce it. This 
> issue occurs only when the table row format is ParquetHiveSerDe and the 
> target table is Partitioned
> *Hive:*
> Login in to hive terminal on cluster and create below tables.
> {code:java}
> create table t_src(
> name varchar(10),
> dob timestamp
> )
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> create table t_tgt(
> name varchar(10),
> dob timestamp
> )
> PARTITIONED BY (city varchar(10))
> ROW FORMAT SERDE 
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
> STORED AS INPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> {code}
> Insert data into the source table (t_src)
> {code:java}
> INSERT INTO t_src VALUES ('p1', '0001-01-01 00:00:00.0'),('p2', '0002-01-01 
> 00:00:00.0'), ('p3', '0003-01-01 00:00:00.0'),('p4', '0004-01-01 
> 00:00:00.0');{code}
> *Spark-shell:*
> Get on to spark-shell. 
> Execute below commands on spark shell:
> {code:java}
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT CAST(alias.name AS STRING) as a0, alias.dob as a1 FROM 
> DEFAULT.t_src alias"
> val q2 = "INSERT INTO TABLE DEFAULT.t_tgt PARTITION (city) SELECT tbl0.a0 as 
> c0, tbl0.a1 as c1, NULL as c2 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0","a1").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
> {code}
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0002-01-01 00:00:00". Below snippets shows 
> the contents of both the tables:
> {code:java}
> select * from t_src;
> +-++--+
> | t_src.name | t_src.dob |
> +-++--+
> | p1 | 0001-01-01 00:00:00.0 |
> | p2 | 0002-01-01 00:00:00.0 |
> | p3 | 0003-01-01 00:00:00.0 |
> | p4 | 0004-01-01 00:00:00.0 |
> +-++–+
>  select * from t_tgt;
> +-++--+
> | t_src.name | t_src.dob | t_tgt.city |
> +-++--+
> | p1 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p2 | 0002-01-01 00:00:00.0 |__HIVE_DEF |
> | p3 | 0003-01-01 00:00:00.0 |__HIVE_DEF |
> | p4 | 0004-01-01 00:00:00.0 |__HIVE_DEF |
> +-++--+
> {code}
>  
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Pawan Lawale



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26019.
--
Resolution: Cannot Reproduce

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692495#comment-16692495
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

I don't understand how the swapping two lines solve this issue. Also, I don't 
get how it happens and how to reproduce this.
[~Tagar], It's you who reported the issue and it looks odd to ask an arbitrary 
guy for an idea to test it in Apache JIRA.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26123) Make all MLlib/ML Models public

2018-11-19 Thread Nicholas Resnick (JIRA)

Nicholas Resnick created SPARK-26123:


 Summary: Make all MLlib/ML Models public
 Key: SPARK-26123
 URL: https://issues.apache.org/jira/browse/SPARK-26123
 Project: Spark
  Issue Type: Wish
  Components: ML, MLlib
Affects Versions: 2.4.0
Reporter: Nicholas Resnick


Can all Model subclasses be made public? It's very difficult to make custom 
Models that build off of MLlib/ML Models otherwise. I also can't think of a 
reason why Estimators should be public, but not Models.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692382#comment-16692382
 ] 

Apache Spark commented on SPARK-26094:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/23092

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26094:


Assignee: Apache Spark  (was: Imran Rashid)

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692380#comment-16692380
 ] 

Apache Spark commented on SPARK-26094:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/23092

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26094) Streaming WAL should create parent dirs

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26094:


Assignee: Imran Rashid  (was: Apache Spark)

> Streaming WAL should create parent dirs
> ---
>
> Key: SPARK-26094
> URL: https://issues.apache.org/jira/browse/SPARK-26094
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Blocker
>
> SPARK-25871 introduced a regression in the streaming WAL -- it no longer 
> makes all the parent dirs, so you may see an exception like this in cases 
> that used to work:
> {noformat}
> 18/11/09 03:31:48 ERROR util.FileBasedWriteAheadLog_ReceiverSupervisorImpl: 
> Failed to write to write ahead log after 3 failures
> ...
> org.apache.spark.SparkException: Exception thrown in awaitResult:
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at 
> org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
> ...
> Caused by: java.io.FileNotFoundException: Parent directory doesn't exist: 
> /tmp/__spark__1e8ba184-d323-47eb-b857-0e6285409424/88992/checkpoints/receivedData/0
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyParentDir(FSDirectory.java:1923)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692377#comment-16692377
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~irashid] thanks a lot for looking at this ! 
It makes sense two swap those two lines to call parent class constructor after 
auth_token has been initialized. 

We're using Cloudera's Spark, and pyspark dependencies are inside of a zip 
file, in a "immutable" parcel... 
Unfortunately there is no quick way to test it  as it has to be propagated into 
all worker nodes. [~bersprockets] any ideas how to test this? 

Thank you.


> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692355#comment-16692355
 ] 

Imran Rashid commented on SPARK-26019:
--

[~hyukjin.kwon] you think maybe these two lines need to be swapped?

https://github.com/apache/spark/blob/master/python/pyspark/accumulators.py#L274-L275

any change you can try with that change [~Tagar] since you've got a handy 
environment to reproduce?

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26122:


Assignee: (was: Apache Spark)

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692306#comment-16692306
 ] 

Apache Spark commented on SPARK-26122:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23091

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692305#comment-16692305
 ] 

Apache Spark commented on SPARK-26122:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23091

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26122:


Assignee: Apache Spark

> Support encoding for multiLine in CSV datasource
> 
>
> Key: SPARK-26122
> URL: https://issues.apache.org/jira/browse/SPARK-26122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Currently, CSV datasource is not able to read CSV files in different encoding 
> when multiLine is enabled. The ticket aims to support the encoding CSV 
> options in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26122) Support encoding for multiLine in CSV datasource

2018-11-19 Thread Maxim Gekk (JIRA)

Maxim Gekk created SPARK-26122:
--

 Summary: Support encoding for multiLine in CSV datasource
 Key: SPARK-26122
 URL: https://issues.apache.org/jira/browse/SPARK-26122
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, CSV datasource is not able to read CSV files in different encoding 
when multiLine is enabled. The ticket aims to support the encoding CSV options 
in the mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26118:


Assignee: Apache Spark

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Apache Spark
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692248#comment-16692248
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23090

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692246#comment-16692246
 ] 

Apache Spark commented on SPARK-26118:
--

User 'attilapiros' has created a pull request for this issue:
https://github.com/apache/spark/pull/23090

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26118:


Assignee: (was: Apache Spark)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692237#comment-16692237
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

cc [~lucacanali] 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692233#comment-16692233
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Sorry, nope it was broken by this change - 
https://github.com/apache/spark/commit/15fc2372269159ea2556b028d4eb8860c4108650#diff-c3339bbf2b850b79445b41e9eecf57c4R249
 



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692226#comment-16692226
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

Might be broken by https://github.com/apache/spark/pull/22635 change 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692219#comment-16692219
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

[~hyukjin.kwon] today I reproduced this first time .. but we still receive 
reports from other our users as well. 

Here's code on Spark 2.3.2 + Python 2.7.15.

Execute on a freshly created Spark session :

{code:python}

def python_major_version ():
import sys
return(sys.version_info[0])


print(python_major_version())

print(sc.parallelize([1]).map(lambda x: python_major_version()).collect())# 
error happens here !

{code}

It always reproduces for me.

Notice that just rerunning the same code makes this error disappear.



> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-19 Thread Ruslan Dautkhanov (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov reopened SPARK-26019:
---

Reproduced myself 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26121) [Structured Streaming] Allow users to define prefix of Kafka's consumer group (group.id)

2018-11-19 Thread Anastasios Zouzias (JIRA)

Anastasios Zouzias created SPARK-26121:
--

 Summary: [Structured Streaming] Allow users to define prefix of 
Kafka's consumer group (group.id)
 Key: SPARK-26121
 URL: https://issues.apache.org/jira/browse/SPARK-26121
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Anastasios Zouzias


I run in the following situation with Spark Structure Streaming (SS) using 
Kafka.
 
In a project that I work on, there is already a secured Kafka setup where ops 
can issue an SSL certificate per "[group.id|http://group.id/];, which should be 
predefined (or its prefix to be predefined).
 
On the other hand, Spark SS fixes the [group.id|http://group.id/] to 
 
val uniqueGroupId = 
s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}"
 
see, i.e.,
 
[https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L124]
 
https://github.com/apache/spark/blob/v2.4.0/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala#L81
 
I guess Spark developers had a good reason to fix it, but is it possible to 
make configurable the prefix of the above uniqueGroupId ("spark-kafka-source")?
 
The rational is that spark users are not forced to use the same certificate on 
group-ids of the form (spark-kafka-source-*).
 
DoD:
* Allow spark SS users to define the group.id prefix as input parameter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26120:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-26120:


 Summary: Fix a streaming query leak in Structured Streaming R tests
 Key: SPARK-26120
 URL: https://issues.apache.org/jira/browse/SPARK-26120
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.4.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


"Specify a schema by using a DDL-formatted string when reading" doesn't stop 
the streaming query before stopping Spark. It causes the following annoying 
logs.

{code}
Exception in thread "stream execution thread for [id = 
186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
... 7 more
org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
... 7 more
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692178#comment-16692178
 ] 

Apache Spark commented on SPARK-26120:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23089

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26119:


Assignee: (was: Apache Spark)

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692176#comment-16692176
 ] 

Apache Spark commented on SPARK-26120:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/23089

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Assigned] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26120:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26119:


Assignee: Apache Spark

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Apache Spark
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692175#comment-16692175
 ] 

Apache Spark commented on SPARK-26119:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/23088

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26120:
-
Priority: Minor  (was: Major)

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26120) Fix a streaming query leak in Structured Streaming R tests

2018-11-19 Thread Shixiong Zhu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-26120:
-
Component/s: Structured Streaming
 SparkR

> Fix a streaming query leak in Structured Streaming R tests
> --
>
> Key: SPARK-26120
> URL: https://issues.apache.org/jira/browse/SPARK-26120
> Project: Spark
>  Issue Type: Test
>  Components: SparkR, Structured Streaming, Tests
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> "Specify a schema by using a DDL-formatted string when reading" doesn't stop 
> the streaming query before stopping Spark. It causes the following annoying 
> logs.
> {code}
> Exception in thread "stream execution thread for [id = 
> 186dad10-e87f-4155-8119-00e0e63bbc1a, runId = 
> 2c0cc158-410b-442f-ac36-20f80ec429b1]" Exception in thread "stream execution 
> thread for people3 [id = ffa6136d-fe7b-4777-aa47-b0cb64d07ea4, runId = 
> 644b888e-9cce-4a09-bb5e-2fb122796c19]" org.apache.spark.SparkException: 
> Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:399)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:342)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:323)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:204)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 7 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692093#comment-16692093
 ] 

shahid commented on SPARK-26119:


Thanks. I am working on it

> Task metrics summary in the stage page should contain only successful tasks 
> metrics
> ---
>
> Key: SPARK-26119
> URL: https://issues.apache.org/jira/browse/SPARK-26119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.2
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Currently task metrics summary table in the stage tab shows summary 
> corresponds to all the tasks. But, we should display the summary of only 
> succeeded tasks in the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26119) Task metrics summary in the stage page should contain only successful tasks metrics

2018-11-19 Thread ABHISHEK KUMAR GUPTA (JIRA)

ABHISHEK KUMAR GUPTA created SPARK-26119:


 Summary: Task metrics summary in the stage page should contain 
only successful tasks metrics
 Key: SPARK-26119
 URL: https://issues.apache.org/jira/browse/SPARK-26119
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.3.2
Reporter: ABHISHEK KUMAR GUPTA


Currently task metrics summary table in the stage tab shows summary corresponds 
to all the tasks. But, we should display the summary of only succeeded tasks in 
the tasks summary metrics table



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692018#comment-16692018
 ] 

Attila Zsolt Piros commented on SPARK-26118:


I am working on a PR.

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-26118:
---
Affects Version/s: (was: 2.3.2)
   (was: 2.4.0)
   (was: 2.2.2)
   (was: 2.1.3)

> Make Jetty's requestHeaderSize configurable in Spark
> 
>
> Key: SPARK-26118
> URL: https://issues.apache.org/jira/browse/SPARK-26118
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For long authorization fields the request header size could be over the 
> default limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request 
> Entity Too Large).
> This issue may occur if the user is a member of many Active Directory user 
> groups.
> The HTTP request to the server contains the Kerberos token in the 
> WWW-Authenticate header. The header size increases together with the number 
> of user groups. 
> Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26118) Make Jetty's requestHeaderSize configurable in Spark

2018-11-19 Thread Attila Zsolt Piros (JIRA)

Attila Zsolt Piros created SPARK-26118:
--

 Summary: Make Jetty's requestHeaderSize configurable in Spark
 Key: SPARK-26118
 URL: https://issues.apache.org/jira/browse/SPARK-26118
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0, 2.3.2, 2.2.2, 2.1.3, 3.0.0
Reporter: Attila Zsolt Piros


For long authorization fields the request header size could be over the default 
limit (8192 bytes) and in this case Jetty replies HTTP 413 (Request Entity Too 
Large).

This issue may occur if the user is a member of many Active Directory user 
groups.

The HTTP request to the server contains the Kerberos token in the 
WWW-Authenticate header. The header size increases together with the number of 
user groups. 

Currently there is no way in Spark to override this limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-19 Thread Long Cao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691965#comment-16691965
 ] 

Long Cao commented on SPARK-26026:
--

[~srowen] thanks for the expedient resolution! It's a small thing to me but 
nice to see.

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23886) update query.status

2018-11-19 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691881#comment-16691881
 ] 

Gabor Somogyi commented on SPARK-23886:
---

Thanks [~efim.poberezkin]! May I ask you then to close your PRs for this Jira 
and for SPARK-24063?

> update query.status
> ---
>
> Key: SPARK-23886
> URL: https://issues.apache.org/jira/browse/SPARK-23886
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26090) Resolve most miscellaneous deprecation and build warnings for Spark 3

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26090.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23065
[https://github.com/apache/spark/pull/23065]

> Resolve most miscellaneous deprecation and build warnings for Spark 3
> -
>
> Key: SPARK-26090
> URL: https://issues.apache.org/jira/browse/SPARK-26090
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The build has a lot of deprecation warnings. Some are new in Scala 2.12 and 
> Java 11. We've fixed some, but I wanted to take a pass at fixing lots of easy 
> miscellaneous ones here.
> They're too numerous and small to list here; see the pull request. Some 
> highlights:
> - @BeanInfo is deprecated in 2.12, and BeanInfo classes are pretty ancient in 
> Java. Instead, case classes can explicitly declare getters
> - Eta expansion of zero-arg methods; foo() becomes () => foo() in many cases
> - Floating-point Range is inexact and deprecated, like 0.0 to 100.0 by 1.0
> - finalize() is finally deprecated (just needs to be suppressed)
> - StageInfo.attempId was deprecated and easiest to remove here
> I'm not now going to touch some chunks of deprecation warnings:
> - Parquet deprecations
> - Hive deprecations (particularly serde2 classes)
> - Deprecations in generated code (mostly Thriftserver CLI)
> - ProcessingTime deprecations (we may need to revive this class as internal)
> - many MLlib deprecations because they concern methods that may be removed 
> anyway
> - a few Kinesis deprecations I couldn't figure out
> - Mesos get/setRole, which I don't know well
> - Kafka/ZK deprecations (e.g. poll())
> - Kinesis
> - a few other ones that will probably resolve by deleting a deprecated method



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-19 Thread Sunil Rangwani (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691828#comment-16691828
 ] 

Sunil Rangwani commented on SPARK-14492:


[~ottomata] you are using a CDH version I believe. I don't know what 
dependencies their parcels include.

This bug was originally found in the open source version built using maven as 
described here 
[https://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support]
 

 

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-19 Thread Sunil Rangwani (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691811#comment-16691811
 ] 

Sunil Rangwani commented on SPARK-14492:


[~srowen]

Please reopen this.

It is not about building with varying versions of Hive. After building with the 
required version of Hive, it gives a runtime error - as explained in my earlier 
comment. 

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25528) data source V2 API refactoring (batch read)

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-25528:

Summary: data source V2 API refactoring (batch read)  (was: data source V2 
read side API refactoring)

> data source V2 API refactoring (batch read)
> ---
>
> Key: SPARK-25528
> URL: https://issues.apache.org/jira/browse/SPARK-25528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> refactor the read side API according to this abstraction
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25528) data source V2 API refactoring (batch read)

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691799#comment-16691799
 ] 

Apache Spark commented on SPARK-25528:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23086

> data source V2 API refactoring (batch read)
> ---
>
> Key: SPARK-25528
> URL: https://issues.apache.org/jira/browse/SPARK-25528
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> refactor the read side API according to this abstraction
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26071) disallow map as map key

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26071.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23045
[https://github.com/apache/spark/pull/23045]

> disallow map as map key
> ---
>
> Key: SPARK-26071
> URL: https://issues.apache.org/jira/browse/SPARK-26071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14492) Spark SQL 1.6.0 does not work with external Hive metastore version lower than 1.2.0; its not backwards compatible with earlier version

2018-11-19 Thread Andrew Otto (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691787#comment-16691787
 ] 

Andrew Otto commented on SPARK-14492:
-

I found my issue: we were loading some Hive 1.1.0 jars manually using 
spark.driver.extraClassPath in order to initiate a Hive JDBC directly to Hive, 
instead of using spark.sql() to work around 
https://issues.apache.org/jira/browse/SPARK-23890.  The Hive 1.1.0 classes were 
loaded before the ones included with Spark, and as such they failed referencing 
a Hive configuration that didn't exist in 1.1.0.

 

> Spark SQL 1.6.0 does not work with external Hive metastore version lower than 
> 1.2.0; its not backwards compatible with earlier version
> --
>
> Key: SPARK-14492
> URL: https://issues.apache.org/jira/browse/SPARK-14492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Sunil Rangwani
>Priority: Critical
>
> Spark SQL when configured with a Hive version lower than 1.2.0 throws a 
> java.lang.NoSuchFieldError for the field METASTORE_CLIENT_SOCKET_LIFETIME 
> because this field was introduced in Hive 1.2.0 so its not possible to use 
> Hive metastore version lower than 1.2.0 with Spark. The details of the Hive 
> changes can be found here: https://issues.apache.org/jira/browse/HIVE-9508 
> {code:java}
> Exception in thread "main" java.lang.NoSuchFieldError: 
> METASTORE_CLIENT_SOCKET_LIFETIME
>   at 
> org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:500)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:250)
>   at 
> org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
>   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:272)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$4.apply(SQLContext.scala:271)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at org.apache.spark.sql.SQLContext.(SQLContext.scala:271)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:58)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:267)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:139)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26112) Update since versions of new built-in functions.

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26112.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23082
[https://github.com/apache/spark/pull/23082]

> Update since versions of new built-in functions.
> 
>
> Key: SPARK-26112
> URL: https://issues.apache.org/jira/browse/SPARK-26112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> The following 5 functions were removed from branch-2.4:
> - map_entries
> - map_filter
> - transform_values
> - transform_keys
> - map_zip_with
> We should update the since version to 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26024:
---

Assignee: Julien Peloton

> Dataset API: repartitionByRange(...) has inconsistent behaviour
> ---
>
> Key: SPARK-26024
> URL: https://issues.apache.org/jira/browse/SPARK-26024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Spark version 2.3.2
>Reporter: Julien Peloton
>Assignee: Julien Peloton
>Priority: Major
>  Labels: dataFrame, partitioning, repartition, spark-sql
> Fix For: 3.0.0
>
>
> Hi,
> I recently played with the {{repartitionByRange}} method for DataFrame 
> introduced in SPARK-22614. For DataFrames larger than the one tested in the 
> code (which has only 10 elements), the code sends back random results.
> As a test for showing the inconsistent behaviour, I start as the unit code 
> used to test {{repartitionByRange}} 
> ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352])
>  but I increase the size of the initial array to 1000, repartition using 3 
> partitions, and count the number of element per-partitions:
>  
> {code}
> // Shuffle numbers from 0 to 1000, and make a DataFrame
> val df = Random.shuffle(0.to(1000)).toDF("val")
> // Repartition it using 3 partitions
> // Sum up number of elements in each partition, and collect it.
> // And do it several times
> for (i <- 0 to 9) {
>   var counts = df.repartitionByRange(3, col("val"))
>     .mapPartitions{part => Iterator(part.size)}
> .collect()
>   println(counts.toList)
> }
> // -> the number of elements in each partition varies...
> {code}
> I do not know whether it is expected (I will dig further in the code), but it 
> sounds like a bug.
>  Or I just misinterpret what {{repartitionByRange}} is for?
>  Any ideas?
> Thanks!
>  Julien



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26024) Dataset API: repartitionByRange(...) has inconsistent behaviour

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26024.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23025
[https://github.com/apache/spark/pull/23025]

> Dataset API: repartitionByRange(...) has inconsistent behaviour
> ---
>
> Key: SPARK-26024
> URL: https://issues.apache.org/jira/browse/SPARK-26024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
> Environment: Spark version 2.3.2
>Reporter: Julien Peloton
>Priority: Major
>  Labels: dataFrame, partitioning, repartition, spark-sql
> Fix For: 3.0.0
>
>
> Hi,
> I recently played with the {{repartitionByRange}} method for DataFrame 
> introduced in SPARK-22614. For DataFrames larger than the one tested in the 
> code (which has only 10 elements), the code sends back random results.
> As a test for showing the inconsistent behaviour, I start as the unit code 
> used to test {{repartitionByRange}} 
> ([here|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala#L352])
>  but I increase the size of the initial array to 1000, repartition using 3 
> partitions, and count the number of element per-partitions:
>  
> {code}
> // Shuffle numbers from 0 to 1000, and make a DataFrame
> val df = Random.shuffle(0.to(1000)).toDF("val")
> // Repartition it using 3 partitions
> // Sum up number of elements in each partition, and collect it.
> // And do it several times
> for (i <- 0 to 9) {
>   var counts = df.repartitionByRange(3, col("val"))
>     .mapPartitions{part => Iterator(part.size)}
> .collect()
>   println(counts.toList)
> }
> // -> the number of elements in each partition varies...
> {code}
> I do not know whether it is expected (I will dig further in the code), but it 
> sounds like a bug.
>  Or I just misinterpret what {{repartitionByRange}} is for?
>  Any ideas?
> Thanks!
>  Julien



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26112) Update since versions of new built-in functions.

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26112:
---

Assignee: Takuya Ueshin

> Update since versions of new built-in functions.
> 
>
> Key: SPARK-26112
> URL: https://issues.apache.org/jira/browse/SPARK-26112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0
>
>
> The following 5 functions were removed from branch-2.4:
> - map_entries
> - map_filter
> - transform_values
> - transform_keys
> - map_zip_with
> We should update the since version to 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26068:
---

Assignee: Liu, Linhong

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Assignee: Liu, Linhong
>Priority: Major
> Fix For: 3.0.0
>
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26068) ChunkedByteBufferInputStream is truncated by empty chunk

2018-11-19 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26068.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23040
[https://github.com/apache/spark/pull/23040]

> ChunkedByteBufferInputStream is truncated by empty chunk
> 
>
> Key: SPARK-26068
> URL: https://issues.apache.org/jira/browse/SPARK-26068
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liu, Linhong
>Priority: Major
> Fix For: 3.0.0
>
>
> If ChunkedByteBuffer contains empty chunk in the middle of it, then the 
> ChunkedByteBufferInputStream will be truncated. All data behind the empty 
> chunk will not be read.
> The problematic code:
> {code:java}
> // ChunkedByteBuffer.scala
> // Assume chunks.next returns an empty chunk, then we will reach
> // else branch no matter chunks.hasNext = true or not. So some data is lost.
> override def read(dest: Array[Byte], offset: Int, length: Int): Int = {
>   if (currentChunk != null && !currentChunk.hasRemaining && chunks.hasNext)   
>  {
> currentChunk = chunks.next()
>   }
>   if (currentChunk != null && currentChunk.hasRemaining) {
> val amountToGet = math.min(currentChunk.remaining(), length)
> currentChunk.get(dest, offset, amountToGet)
> amountToGet
>   } else {
> close()
> -1
>   }
> } {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16691725#comment-16691725
 ] 

Apache Spark commented on SPARK-26117:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/23084

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Priority: Major
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-26043:
-

Assignee: Sean Owen

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26043) Make SparkHadoopUtil private to Spark

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26043.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23066
[https://github.com/apache/spark/pull/23066]

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-19 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26026.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23069
[https://github.com/apache/spark/pull/23069]

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Assignee: Sean Owen
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26117:


Assignee: (was: Apache Spark)

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Priority: Major
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26117:


Assignee: Apache Spark

> use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception
> --
>
> Key: SPARK-26117
> URL: https://issues.apache.org/jira/browse/SPARK-26117
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.5.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Major
>
> the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
> entire executor when an OutOfMemoryError is thrown.
> so apply for memory using MemoryConsumer. allocatePage when catch exception, 
> use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-26117) use SparkOutOfMemoryError instead of OutOfMemoryError when catch exception

2018-11-19 Thread caoxuewen (JIRA)

caoxuewen created SPARK-26117:
-

 Summary: use SparkOutOfMemoryError instead of OutOfMemoryError 
when catch exception
 Key: SPARK-26117
 URL: https://issues.apache.org/jira/browse/SPARK-26117
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.5.0
Reporter: caoxuewen


the pr #20014 which introduced SparkOutOfMemoryError to avoid killing the 
entire executor when an OutOfMemoryError is thrown.
so apply for memory using MemoryConsumer. allocatePage when catch exception, 
use SparkOutOfMemoryError instead of OutOfMemoryError



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Sergey Zhemzhitsky (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Zhemzhitsky updated SPARK-26114:
---
Description: 
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memoryOverhead=512 \
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.

Here is dominator tree from the heapdump
 !run1-noparams-dominator-tree.png|width=700!

4 instances of ExternalSorter, although there are only 2 concurrently running 
tasks per executor.
 !run1-noparams-dominator-tree-externalsorter.png|width=700! 

And paths to GC root of the already stopped ExternalSorter.
 !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 

  was:
Trying to use _coalesce_ after shuffle-oriented transformations leads to 
OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
GB of Y GB physical memory used. Consider 
boostingspark.yarn.executor.memoryOverhead_.
Discussion is 
[here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].

The error happens when trying specify pretty small number of partitions in 
_coalesce_ call.

*How to reproduce?*

# Start spark-shell
{code:bash}
spark-shell \ 
  --num-executors=5 \ 
  --executor-cores=2 \ 
  --master=yarn \
  --deploy-mode=client \ 
  --conf spark.executor.memory=1g \ 
  --conf spark.dynamicAllocation.enabled=false \
  --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
{code}
Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
memory usage seems to be the only way to control the amount of memory used for 
shuffle data transferring by now.
Also note that the total number of cores allocated for job is 5x2=10
# Then generate some test data
{code:scala}
import org.apache.hadoop.io._ 
import org.apache.hadoop.io.compress._ 
import org.apache.commons.lang._ 
import org.apache.spark._ 

// generate 100M records of sample data 
sc.makeRDD(1 to 1000, 1000) 
  .flatMap(item => (1 to 10) 
.map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) -> 
new Text(RandomStringUtils.randomAlphanumeric(1024 
  .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
{code}
# Run the sample job
{code:scala}
import org.apache.hadoop.io._
import org.apache.spark._
import org.apache.spark.storage._

val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
rdd 
  .map(item => item._1.toString -> item._2.toString) 
  .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
  .coalesce(10,false) 
  .count 
{code}
Note that the number of partitions is equal to the total number of cores 
allocated to the job.

Here is dominator tree from the heapdump
 !run1-noparams-dominator-tree.png|width=700!

4 instances of ExternalSorter, although there are only 2 concurrently running 
tasks per executor.
 !run1-noparams-dominator-tree-externalsorter.png|width=700!

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also forces to provision huge amounts of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 
In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related and

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also forces to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 
In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related and

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : Spark-12546 is explicitly related

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet leads to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Summary: Spark SQL - Sort when writing partitioned parquet leads to OOM 
errors  (was: Spark SQL - Sort when writing partitioned parquet lead to OOM 
errors)

> Spark SQL - Sort when writing partitioned parquet leads to OOM errors
> -
>
> Key: SPARK-26116
> URL: https://issues.apache.org/jira/browse/SPARK-26116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Pierre Lienhart
>Priority: Major
>
> When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
> sorts each partition before writing but this sort consumes a huge amount of 
> memory compared to the size of the data. The executors can then go OOM and 
> get killed by YARN. As a consequence, it also force to provision huge amount 
> of memory compared to the data to be written.
> Error messages found in the Spark UI are like the following :
> {code:java}
> Spark UI description of failure : Job aborted due to stage failure: Task 169 
> in stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 
> 2.0 (TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure 
> (executor 1 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 8.1 GB of 8 GB physical memory 
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
>  
> {code:java}
> Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
> executor 1): org.apache.spark.SparkException: Task failed while writing rows
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:99)
>  at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
> /app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
>  (No such file or directory)
>  at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)
>  at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)
>  at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)
>  at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)
>  at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>  ... 8 more{code}
>  
> In the stderr logs, we can see that huge amount of sort data (the partition 
> being sorted here is 250 MB when persisted into memory, deserialized) is 
> being spilled to the disk ({{INFO

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 
In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related and

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue : SPARK-12546 is explicitly related

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({{INFO UnsafeExternalSorter: Thread 155 spilling sort data 
of 3.6 GB to disk}}). Sometimes the data is spilled in time to the disk and the 
sort completes ({{INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.}}) but sometimes it does not and we see multiple 
{{TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.}} 
until the application finally runs OOM with logs such as {{ERROR 
UnsafeExternalSorter: Unable to grow the pointer array}}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks like a known issue :

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {{partitionBy}}, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling 
sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to 
the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting 
complete. Writing out partition files one at a time.{code}) but sometimes it 
does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a 
page (67108864 bytes), try again.{code} until the application finally runs OOM 
with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the 
pointer array{code}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.  

It looks

[jira] [Updated] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Lienhart updated SPARK-26116:

Description: 
When writing partitioned parquet using {code:java}partitionBy{code}, it looks 
like Spark sorts each partition before writing but this sort consumes a huge 
amount of memory compared to the size of the data. The executors can then go 
OOM and get killed by YARN. As a consequence, it also force to provision huge 
amount of memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

 
{code:java}
Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
{code}
 
{code:java}
Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more{code}
 

 

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk ({code:java}INFO UnsafeExternalSorter: Thread 155 spilling 
sort data of 3.6 GB to disk{code}). Sometimes the data is spilled in time to 
the disk and the sort completes ({code:java}INFO FileFormatWriter: Sorting 
complete. Writing out partition files one at a time.{code}) but sometimes it 
does not and we see multiple {code:java}TaskMemoryManager: Failed to allocate a 
page (67108864 bytes), try again.{code} until the application finally runs OOM 
with logs such as {code:java}ERROR UnsafeExternalSorter: Unable to grow the 
pointer array{code}.

I should mention that when looking at individual (successful) write tasks in 
the Spark UI, the Peak Execution Memory metric is always 0.

[jira] [Created] (SPARK-26116) Spark SQL - Sort when writing partitioned parquet lead to OOM errors

2018-11-19 Thread Pierre Lienhart (JIRA)

Pierre Lienhart created SPARK-26116:
---

 Summary: Spark SQL - Sort when writing partitioned parquet lead to 
OOM errors
 Key: SPARK-26116
 URL: https://issues.apache.org/jira/browse/SPARK-26116
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Pierre Lienhart


When writing partitioned parquet using ```partitionBy```, it looks like Spark 
sorts each partition before writing but this sort consumes a huge amount of 
memory compared to the size of the data. The executors can then go OOM and get 
killed by YARN. As a consequence, it also force to provision huge amount of 
memory compared to the data to be written.

Error messages found in the Spark UI are like the following :

```

Spark UI description of failure : Job aborted due to stage failure: Task 169 in 
stage 2.0 failed 1 times, most recent failure: Lost task 169.0 in stage 2.0 
(TID 98, x.xx.x.xx, executor 1): ExecutorLostFailure (executor 
1 exited caused by one of the running tasks) Reason: Container killed by YARN 
for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.

```

```

Job aborted due to stage failure: Task 66 in stage 4.0 failed 1 times, most 
recent failure: Lost task 66.0 in stage 4.0 (TID 56, xxx.x.x.xx, 
executor 1): org.apache.spark.SparkException: Task failed while writing rows

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:204)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)

 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)

 at org.apache.spark.scheduler.Task.run(Task.scala:99)

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)

 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@75194804 : 
/app/hadoop/yarn/local/usercache/at053351/appcache/application_1537536072724_17039/blockmgr-a4ba7d59-e780-4385-99b4-a4c4fe95a1ec/25/temp_local_a542a412-5845-45d2-9302-bbf5ee4113ad
 (No such file or directory)

 at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:188)

 at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:254)

 at 
org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:92)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:347)

 at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:425)

 at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:160)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:364)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)

 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)

 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)

 ... 8 more

 ```

In the stderr logs, we can see that huge amount of sort data (the partition 
being sorted here is 250 MB when persisted into memory, deserialized) is being 
spilled to the disk (```INFO UnsafeExternalSorter: Thread 155 spilling sort 
data of 3.6 GB to disk```). Sometimes the data is spilled in time to the disk 
and the sort completes (```INFO FileFormatWriter: Sorting complete. Writing out 
partition files one at a time.```) but sometimes it does not and we see 
multiple ```TaskMemoryManager: Failed to allocate a page (67108864 bytes), try 
again.``` until the application finally runs OOM with logs such as ```ERROR 
UnsafeExternalSorter: Unable to grow the pointer array```.

I should mention

[jira] [Resolved] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module

2018-11-19 Thread Gengliang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-26115.

Resolution: Duplicate

Duplicated to SPARK-2609

> Fix deprecated warnings related to scala 2.12 in core module
> 
>
> Key: SPARK-26115
> URL: https://issues.apache.org/jira/browse/SPARK-26115
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> {quote}[warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => SparkContext.this.reportHeartBeat()).
> [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
> "driver-heartbeater",
> [warn]   ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()).
> [warn] val providers = Seq(new 
> HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++
> [warn] ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => Executor.this.reportHeartBeat()).
> [warn]   private val heartbeater = new Heartbeater(env.memoryManager, 
> reportHeartBeat,
> [warn]^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102:
>  method childGroup in class ServerBootstrap is deprecated: see corresponding 
> Javadoc for more information.
> [warn] if (bootstrap != null && bootstrap.childGroup() != null) {
> [warn]^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn] new ForkJoinTaskSupport(new ForkJoinPool(8))
> [warn] ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59:
>  value attemptId in class StageInfo is deprecated (since 2.3.0): Use 
> attemptNumber instead
> [warn]   def attemptNumber(): Int = attemptId
> [warn]  ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn]   def newForkJoinPool(prefix: String, maxThreadNumber: Int): 
> SForkJoinPool = {
> [warn]  ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn]   override def newThread(pool: SForkJoinPool) =
> [warn]^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190:
>  type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): 
> use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias
> [warn] new SForkJoinWorkerThread(pool) {
> [warn] ^
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194:
>  type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
> java.util.concurrent.ForkJoinPool directly, instead of this alias
> [warn] new SForkJoinPool(maxThreadNumber, factory,
> [warn] ^
> [warn] 10 warnings found
> [warn] 
> /Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
>  Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
> write (() => SparkContext.this.reportHeartBeat()).
> [warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
> "driver-heartbeater",
> [warn] 
> [warn] 
>

[jira] [Created] (SPARK-26115) Fix deprecated warnings related to scala 2.12 in core module

2018-11-19 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-26115:
--

 Summary: Fix deprecated warnings related to scala 2.12 in core 
module
 Key: SPARK-26115
 URL: https://issues.apache.org/jira/browse/SPARK-26115
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Gengliang Wang


{quote}[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => SparkContext.this.reportHeartBeat()).
[warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
"driver-heartbeater",
[warn]   ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()).
[warn] val providers = Seq(new 
HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++
[warn] ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/executor/Executor.scala:193:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => Executor.this.reportHeartBeat()).
[warn]   private val heartbeater = new Heartbeater(env.memoryManager, 
reportHeartBeat,
[warn]^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102:
 method childGroup in class ServerBootstrap is deprecated: see corresponding 
Javadoc for more information.
[warn] if (bootstrap != null && bootstrap.childGroup() != null) {
[warn]^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala:63:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn] new ForkJoinTaskSupport(new ForkJoinPool(8))
[warn] ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/scheduler/StageInfo.scala:59:
 value attemptId in class StageInfo is deprecated (since 2.3.0): Use 
attemptNumber instead
[warn]   def attemptNumber(): Int = attemptId
[warn]  ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:186:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn]   def newForkJoinPool(prefix: String, maxThreadNumber: Int): 
SForkJoinPool = {
[warn]  ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:189:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn]   override def newThread(pool: SForkJoinPool) =
[warn]^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:190:
 type ForkJoinWorkerThread in package forkjoin is deprecated (since 2.12.0): 
use java.util.concurrent.ForkJoinWorkerThread directly, instead of this alias
[warn] new SForkJoinWorkerThread(pool) {
[warn] ^
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/util/ThreadUtils.scala:194:
 type ForkJoinPool in package forkjoin is deprecated (since 2.12.0): use 
java.util.concurrent.ForkJoinPool directly, instead of this alias
[warn] new SForkJoinPool(maxThreadNumber, factory,
[warn] ^
[warn] 10 warnings found
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:505:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => SparkContext.this.reportHeartBeat()).
[warn] _heartbeater = new Heartbeater(env.memoryManager, reportHeartBeat, 
"driver-heartbeater",
[warn] 
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/deploy/security/HadoopDelegationTokenManager.scala:273:
 Eta-expansion of zero-argument methods is deprecated. To avoid this warning, 
write (() => HadoopDelegationTokenManager.this.fileSystemsToAccess()).
[warn] val providers = Seq(new 
HadoopFSDelegationTokenProvider(fileSystemsToAccess)) ++
[warn] 
[warn] 
/Users/gengliangwang/IdeaProjects/spark/core/src/main/scala/org/apache/spark/api/r/RBackend.scala:102:
 method childGroup in class

[jira] [Assigned] (SPARK-26114) Memory leak of PartitionedPairBuffer when coalescing after repartitionAndSortWithinPartitions

2018-11-19 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26114:


Assignee: (was: Apache Spark)

> Memory leak of PartitionedPairBuffer when coalescing after 
> repartitionAndSortWithinPartitions
> -
>
> Key: SPARK-26114
> URL: https://issues.apache.org/jira/browse/SPARK-26114
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
> Environment: Spark 3.0.0-SNAPSHOT (master branch)
> Scala 2.11
> Yarn 2.7
>Reporter: Sergey Zhemzhitsky
>Priority: Major
> Attachments: run1-noparams-dominator-tree-externalsorter-gc-root.png, 
> run1-noparams-dominator-tree-externalsorter.png, 
> run1-noparams-dominator-tree.png
>
>
> Trying to use _coalesce_ after shuffle-oriented transformations leads to 
> OutOfMemoryErrors or _Container killed by YARN for exceeding memory limits. X 
> GB of Y GB physical memory used. Consider 
> boostingspark.yarn.executor.memoryOverhead_.
> Discussion is 
> [here|http://apache-spark-developers-list.1001551.n3.nabble.com/Coalesce-behaviour-td25289.html].
> The error happens when trying specify pretty small number of partitions in 
> _coalesce_ call.
> *How to reproduce?*
> # Start spark-shell
> {code:bash}
> spark-shell \ 
>   --num-executors=5 \ 
>   --executor-cores=2 \ 
>   --master=yarn \
>   --deploy-mode=client \ 
>   --conf spark.executor.memory=1g \ 
>   --conf spark.dynamicAllocation.enabled=false \
>   --conf spark.executor.extraJavaOptions='-XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp -Dio.netty.noUnsafe=true'
> {code}
> Please note using _-Dio.netty.noUnsafe=true_ property. Preventing off-heap 
> memory usage seems to be the only way to control the amount of memory used 
> for shuffle data transferring by now.
> Also note that the total number of cores allocated for job is 5x2=10
> # Then generate some test data
> {code:scala}
> import org.apache.hadoop.io._ 
> import org.apache.hadoop.io.compress._ 
> import org.apache.commons.lang._ 
> import org.apache.spark._ 
> // generate 100M records of sample data 
> sc.makeRDD(1 to 1000, 1000) 
>   .flatMap(item => (1 to 10) 
> .map(i => new Text(RandomStringUtils.randomAlphanumeric(3).toLowerCase) 
> -> new Text(RandomStringUtils.randomAlphanumeric(1024 
>   .saveAsSequenceFile("/tmp/random-strings", Some(classOf[GzipCodec])) 
> {code}
> # Run the sample job
> {code:scala}
> import org.apache.hadoop.io._
> import org.apache.spark._
> import org.apache.spark.storage._
> val rdd = sc.sequenceFile("/tmp/random-strings", classOf[Text], classOf[Text])
> rdd 
>   .map(item => item._1.toString -> item._2.toString) 
>   .repartitionAndSortWithinPartitions(new HashPartitioner(1000)) 
>   .coalesce(10,false) 
>   .count 
> {code}
> Note that the number of partitions is equal to the total number of cores 
> allocated to the job.
> Here is dominator tree from the heapdump
>  !run1-noparams-dominator-tree.png|width=700!
> 4 instances of ExternalSorter, although there are only 2 concurrently running 
> tasks per executor.
>  !run1-noparams-dominator-tree-externalsorter.png|width=700! 
> And paths to GC root of the already stopped ExternalSorter.
>  !run1-noparams-dominator-tree-externalsorter-gc-root.png|width=700! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 119 matches

Mail list logo