[jira] [Assigned] (SPARK-27171) Support Full-Partiton limit in the first scan

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27171:


Assignee: Apache Spark

> Support Full-Partiton limit in the first scan
> -
>
> Key: SPARK-27171
> URL: https://issues.apache.org/jira/browse/SPARK-27171
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: deshanxiao
>Assignee: Apache Spark
>Priority: Major
>
> SparkPlan#executeTake must pick element starting with one partition. 
> Sometimes it will be slow for some query. Although, Spark is better at batch 
> query. It's not bad to add a switch to allow user search all partitons for 
> the first time in limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27171) Support Full-Partiton limit in the first scan

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27171:


Assignee: (was: Apache Spark)

> Support Full-Partiton limit in the first scan
> -
>
> Key: SPARK-27171
> URL: https://issues.apache.org/jira/browse/SPARK-27171
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: deshanxiao
>Priority: Major
>
> SparkPlan#executeTake must pick element starting with one partition. 
> Sometimes it will be slow for some query. Although, Spark is better at batch 
> query. It's not bad to add a switch to allow user search all partitons for 
> the first time in limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27172:
-
Description: 
Can we upgrade embedded jetty servlet on spark 1.6.2? Will there be any 
dependencies that will affected if we do upgrade it? Reason for doing this is  
we would like to the patch the vulnerability that was scanned, which is the 
CRLF injection attacks. Please do refer below information.

Description:

This script is possibly vulnerable to CRLF injection attacks. HTTP headers have 
the structure "Key: Value", where each line is separated by the CRLF 
combination. If the user input is injected into the value section without 
properly escaping/removing CRLF characters it is possible to alter the HTTP 
headers structure. HTTP Response Splitting is a new application attack 
technique which enables various new attacks such as web cache poisoning, cross 
user defacement, hijacking pages with sensitive user information and cross-site 
scripting (XSS). The attacker sends a single HTTP request that forces the web 
server to form an output stream, which is then interpreted by the target as two 
HTTP responses instead of one response.

 CWE #;

CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
Response Splitting')

 

 

 

  was:
Can we upgrade embedded jetty servlet on spark 1.6.2? As per our vulnerability 
scan embedded jetty servlet is vulnerable with CRLF injection attacks. Please 
do refer below information.

Description:

This script is possibly vulnerable to CRLF injection attacks. HTTP headers have 
the structure "Key: Value", where each line is separated by the CRLF 
combination. If the user input is injected into the value section without 
properly escaping/removing CRLF characters it is possible to alter the HTTP 
headers structure. HTTP Response Splitting is a new application attack 
technique which enables various new attacks such as web cache poisoning, cross 
user defacement, hijacking pages with sensitive user information and cross-site 
scripting (XSS). The attacker sends a single HTTP request that forces the web 
server to form an output stream, which is then interpreted by the target as two 
HTTP responses instead of one response.

 CWE #;

CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
Response Splitting')

 

 

 


> CRLF Injection/HTTP response splitting on spark embedded jetty servlet.
> ---
>
> Key: SPARK-27172
> URL: https://issues.apache.org/jira/browse/SPARK-27172
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Major
>
> Can we upgrade embedded jetty servlet on spark 1.6.2? Will there be any 
> dependencies that will affected if we do upgrade it? Reason for doing this is 
>  we would like to the patch the vulnerability that was scanned, which is the 
> CRLF injection attacks. Please do refer below information.
> Description:
> This script is possibly vulnerable to CRLF injection attacks. HTTP headers 
> have the structure "Key: Value", where each line is separated by the CRLF 
> combination. If the user input is injected into the value section without 
> properly escaping/removing CRLF characters it is possible to alter the HTTP 
> headers structure. HTTP Response Splitting is a new application attack 
> technique which enables various new attacks such as web cache poisoning, 
> cross user defacement, hijacking pages with sensitive user information and 
> cross-site scripting (XSS). The attacker sends a single HTTP request that 
> forces the web server to form an output stream, which is then interpreted by 
> the target as two HTTP responses instead of one response.
>  CWE #;
> CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
> Response Splitting')
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Description: 
Will there be a big impact on the system if current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per our vulnerability scan javascript library that we are currently using is 
vulnerable and we wanted to address this vulnerability. Appreciate any help we 
could get from the community. 

*Description:*
 You are using a vulnerable Javascript library. One or more vulnerabilities 
were reported for this version of the Javascript library. Consult Attack 
details and Web References for more information about the affected library and 
the vulnerabilities that were reported.

*CWE #:*
 CWE-16 - Category - configuration
  
  Thank you,

 

  was:
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per our vulnerability scan javascript library that we are currently using is 
vulnerable and we wanted to address this vulnerability. Appreciate any help we 
could get from the community. 

*Description:*
You are using a vulnerable Javascript library. One or more vulnerabilities were 
reported for this version of the Javascript library. Consult Attack details and 
Web References for more information about the affected library and the 
vulnerabilities that were reported.

*CWE #:*
CWE-16 - Category - configuration
  
  

Thank you,

 


> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Will there be a big impact on the system if current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per our vulnerability scan javascript library that we are currently using 
> is vulnerable and we wanted to address this vulnerability. Appreciate any 
> help we could get from the community. 
> *Description:*
>  You are using a vulnerable Javascript library. One or more vulnerabilities 
> were reported for this version of the Javascript library. Consult Attack 
> details and Web References for more information about the affected library 
> and the vulnerabilities that were reported.
> *CWE #:*
>  CWE-16 - Category - configuration
>   
>   Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27156) why is the "http://:18080/static" browse able?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27156:
-
Issue Type: Bug  (was: Question)

> why is the "http://:18080/static" browse able?
> 
>
> Key: SPARK-27156
> URL: https://issues.apache.org/jira/browse/SPARK-27156
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
> Attachments: Screen Shot 2019-03-14 at 11.46.31 AM.png
>
>
> I would like to know is there a way to disable spark history server /static 
> folder ? Please do refer on the attachment provided. Reason for asking is for 
> security purposes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27172:
-
Issue Type: Dependency upgrade  (was: Question)

> CRLF Injection/HTTP response splitting on spark embedded jetty servlet.
> ---
>
> Key: SPARK-27172
> URL: https://issues.apache.org/jira/browse/SPARK-27172
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Major
>
> Can we upgrade embedded jetty servlet on spark 1.6.2? As per our 
> vulnerability scan embedded jetty servlet is vulnerable with CRLF injection 
> attacks. Please do refer below information.
> Description:
> This script is possibly vulnerable to CRLF injection attacks. HTTP headers 
> have the structure "Key: Value", where each line is separated by the CRLF 
> combination. If the user input is injected into the value section without 
> properly escaping/removing CRLF characters it is possible to alter the HTTP 
> headers structure. HTTP Response Splitting is a new application attack 
> technique which enables various new attacks such as web cache poisoning, 
> cross user defacement, hijacking pages with sensitive user information and 
> cross-site scripting (XSS). The attacker sends a single HTTP request that 
> forces the web server to form an output stream, which is then interpreted by 
> the target as two HTTP responses instead of one response.
>  CWE #;
> CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
> Response Splitting')
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27172:
-
Description: 
Can we upgrade embedded jetty servlet on spark 1.6.2? As per our vulnerability 
scan embedded jetty servlet is vulnerable with CRLF injection attacks. Please 
do refer below information.

Description:

This script is possibly vulnerable to CRLF injection attacks. HTTP headers have 
the structure "Key: Value", where each line is separated by the CRLF 
combination. If the user input is injected into the value section without 
properly escaping/removing CRLF characters it is possible to alter the HTTP 
headers structure. HTTP Response Splitting is a new application attack 
technique which enables various new attacks such as web cache poisoning, cross 
user defacement, hijacking pages with sensitive user information and cross-site 
scripting (XSS). The attacker sends a single HTTP request that forces the web 
server to form an output stream, which is then interpreted by the target as two 
HTTP responses instead of one response.

 CWE #;

CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
Response Splitting')

 

 

 

  was:
Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or will 
there be any impact if we do upgrade it ?

Please do refer on description of the vulnerability provided:

Description:

This script is possibly vulnerable to CRLF injection attacks. HTTP headers have 
the structure "Key: Value", where each line is separated by the CRLF 
combination. If the user input is injected into the value section without 
properly escaping/removing CRLF characters it is possible to alter the HTTP 
headers structure. HTTP Response Splitting is a new application attack 
technique which enables various new attacks such as web cache poisoning, cross 
user defacement, hijacking pages with sensitive user information and cross-site 
scripting (XSS). The attacker sends a single HTTP request that forces the web 
server to form an output stream, which is then interpreted by the target as two 
HTTP responses instead of one response.

 

CWE #;

CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
Response Splitting')

 

 


  


> CRLF Injection/HTTP response splitting on spark embedded jetty servlet.
> ---
>
> Key: SPARK-27172
> URL: https://issues.apache.org/jira/browse/SPARK-27172
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Major
>
> Can we upgrade embedded jetty servlet on spark 1.6.2? As per our 
> vulnerability scan embedded jetty servlet is vulnerable with CRLF injection 
> attacks. Please do refer below information.
> Description:
> This script is possibly vulnerable to CRLF injection attacks. HTTP headers 
> have the structure "Key: Value", where each line is separated by the CRLF 
> combination. If the user input is injected into the value section without 
> properly escaping/removing CRLF characters it is possible to alter the HTTP 
> headers structure. HTTP Response Splitting is a new application attack 
> technique which enables various new attacks such as web cache poisoning, 
> cross user defacement, hijacking pages with sensitive user information and 
> cross-site scripting (XSS). The attacker sends a single HTTP request that 
> forces the web server to form an output stream, which is then interpreted by 
> the target as two HTTP responses instead of one response.
>  CWE #;
> CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
> Response Splitting')
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Issue Type: Dependency upgrade  (was: Question)

> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per our vulnerability scan javascript library that we are currently using 
> is vulnerable and we wanted to address this vulnerability. Appreciate any 
> help we could get from the community. 
> *Description:*
> You are using a vulnerable Javascript library. One or more vulnerabilities 
> were reported for this version of the Javascript library. Consult Attack 
> details and Web References for more information about the affected library 
> and the vulnerabilities that were reported.
> *CWE #:*
> CWE-16 - Category - configuration
>   
>   
> Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Description: 
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per our vulnerability scan javascript library that we are currently using is 
vulnerable and we wanted to address this vulnerability. Appreciate any help we 
could get from the community. 

*Description:*
You are using a vulnerable Javascript library. One or more vulnerabilities were 
reported for this version of the Javascript library. Consult Attack details and 
Web References for more information about the affected library and the 
vulnerabilities that were reported.

*CWE #:*
CWE-16 - Category - configuration
  
  

Thank you,

 

  was:
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per our vulnerability scan javascript library that we are currently using is 
vulnerable and we wanted to address this vulnerability. Appreciate any help we 
could get from the community. 

Please do refer on the attachment provided.
  
  

Thank you,

 


> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per our vulnerability scan javascript library that we are currently using 
> is vulnerable and we wanted to address this vulnerability. Appreciate any 
> help we could get from the community. 
> *Description:*
> You are using a vulnerable Javascript library. One or more vulnerabilities 
> were reported for this version of the Javascript library. Consult Attack 
> details and Web References for more information about the affected library 
> and the vulnerabilities that were reported.
> *CWE #:*
> CWE-16 - Category - configuration
>   
>   
> Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Attachment: (was: Vulnerability Javascript library.xlsx)

> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per our vulnerability scan javascript library that we are currently using 
> is vulnerable and we wanted to address this vulnerability. Appreciate any 
> help we could get from the community. 
> Please do refer on the attachment provided.
>   
>   
> Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27172:
-
Description: 
Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or will 
there be any impact if we do upgrade it ?

Please do refer on description of the vulnerability provided:

Description:

This script is possibly vulnerable to CRLF injection attacks. HTTP headers have 
the structure "Key: Value", where each line is separated by the CRLF 
combination. If the user input is injected into the value section without 
properly escaping/removing CRLF characters it is possible to alter the HTTP 
headers structure. HTTP Response Splitting is a new application attack 
technique which enables various new attacks such as web cache poisoning, cross 
user defacement, hijacking pages with sensitive user information and cross-site 
scripting (XSS). The attacker sends a single HTTP request that forces the web 
server to form an output stream, which is then interpreted by the target as two 
HTTP responses instead of one response.

 

CWE #;

CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
Response Splitting')

 

 


  

  was:
Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or will 
there be any impact if we do upgrade it ?

Please do refer on the provided attachment for more information.
 


> CRLF Injection/HTTP response splitting on spark embedded jetty servlet.
> ---
>
> Key: SPARK-27172
> URL: https://issues.apache.org/jira/browse/SPARK-27172
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Major
>
> Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or 
> will there be any impact if we do upgrade it ?
> Please do refer on description of the vulnerability provided:
> Description:
> This script is possibly vulnerable to CRLF injection attacks. HTTP headers 
> have the structure "Key: Value", where each line is separated by the CRLF 
> combination. If the user input is injected into the value section without 
> properly escaping/removing CRLF characters it is possible to alter the HTTP 
> headers structure. HTTP Response Splitting is a new application attack 
> technique which enables various new attacks such as web cache poisoning, 
> cross user defacement, hijacking pages with sensitive user information and 
> cross-site scripting (XSS). The attacker sends a single HTTP request that 
> forces the web server to form an output stream, which is then interpreted by 
> the target as two HTTP responses instead of one response.
>  
> CWE #;
> CWE-113: Improper Neutralization of CRLF Sequences in HTTP Headers ('HTTP 
> Response Splitting')
>  
>  
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27172:
-
Attachment: (was: CRLF injection - Sheet1.pdf)

> CRLF Injection/HTTP response splitting on spark embedded jetty servlet.
> ---
>
> Key: SPARK-27172
> URL: https://issues.apache.org/jira/browse/SPARK-27172
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Major
>
> Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or 
> will there be any impact if we do upgrade it ?
> Please do refer on the provided attachment for more information.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Attachment: Vulnerability Javascript library.xlsx

> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
> Attachments: Vulnerability Javascript library.xlsx
>
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per VA scan javascript library that we are currently using is vulnerable 
> and we wanted to address this vulnerability. Appreciate any help we could get 
> from the community. 
> Please do refer on the attachment provided.
>  
>  
> Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Description: 
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per our vulnerability scan javascript library that we are currently using is 
vulnerable and we wanted to address this vulnerability. Appreciate any help we 
could get from the community. 

Please do refer on the attachment provided.
  
  

Thank you,

 

  was:
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community. 

Please do refer on the attachment provided.
 
 

Thank you,

 


> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
> Attachments: Vulnerability Javascript library.xlsx
>
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per our vulnerability scan javascript library that we are currently using 
> is vulnerable and we wanted to address this vulnerability. Appreciate any 
> help we could get from the community. 
> Please do refer on the attachment provided.
>   
>   
> Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Description: 
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community. 

Please do refer on the attachment provided.
 
 

Thank you,

 

  was:
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community. 

Please do refer below for more information:
|CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
|Vulnerable Javascript library|Medium|You are using a vulnerable Javascript 
library. One or more vulnerabilities were reported for this version of the 
Javascript library. Consult Attack details and Web References for more 
information about the affected library and the vulnerabilities that were 
reported.|Consult References for more information.|Upgrade to the latest 
version.|/static/jquery-1.11.1.min.js
  
 Details
 Detected Javascript library jquery version 1.11.1. The version was detected 
from filename.|References:
 [https://github.com/jquery/jquery/issues/2432]
 [http://blog.jquery.com/2016/01/08/jquery-2-2-and-1-12-released/]
  
 [https://snyk.io/test/npm/jquery/1.11.1]
  
 related reference not directly with spark:
 
[https://community.hortonworks.com/questions/89874/ambari-jquery-172-upgrade-to-jquery191.html]|

 

Thank you,

 


> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per VA scan javascript library that we are currently using is vulnerable 
> and we wanted to address this vulnerability. Appreciate any help we could get 
> from the community. 
> Please do refer on the attachment provided.
>  
>  
> Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22506) Spark thrift server can not impersonate user in kerberos

2019-03-14 Thread Wataru Yukawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793308#comment-16793308
 ] 

Wataru Yukawa commented on SPARK-22506:
---

Hi,

Spark thrift server can impersonate a user in our kerberized hadoop and Spark 
2.1.1(HDP-2.6.2.0) and the following setting
when I execute select query
{code:java}
hive.server2.enable.doAs=true
{code}
But it can't impersonate in create query case.
For example, if you execute the following query, 
/apps/hive/warehouse/hoge.db/piyo in HDFS is hive owner.
{code:java}
create table hoge.piyo(str string)
{code}
Thanks

> Spark thrift server can not impersonate user in kerberos 
> -
>
> Key: SPARK-22506
> URL: https://issues.apache.org/jira/browse/SPARK-22506
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.0
>Reporter: sydt
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Spark thrift server can not impersonate user in kerberos environment.
> I launch spark thrift server in* yarn-client *mode by user *hive* ,which is 
> allowed to impersonate other user.
> User* jt_jzyx_project7* submit sql statement to query its own table located 
> in hdfs catalog: /user/jt_jzyx_project7, and happened errors:
> Permission denied: *user=hive*, access=EXECUTE, 
> inode=*"/user/jt_jzyx_project7*":hdfs:jt_jzyx_project7:drwxrwx---:user:g_dcpt_project1:rwx,group::rwx
> obviously, spark thrift server didn't proxy user: jt_jzyx_project7 in hdfs.
> And this happened task stage, which means it pass the hive authorization.
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)
Jerry Garcia created SPARK-27172:


 Summary: CRLF Injection/HTTP response splitting on spark embedded 
jetty servlet.
 Key: SPARK-27172
 URL: https://issues.apache.org/jira/browse/SPARK-27172
 Project: Spark
  Issue Type: Question
  Components: Web UI
Affects Versions: 1.6.2
Reporter: Jerry Garcia


Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or will 
there be any impact if we do upgrade it ?

Please do refer on the provided attachment for more information.
|CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
|CRLF injection/HTTP response splitting|Medium|This script is possibly 
vulnerable to CRLF injection attacks.
HTTP headers have the structure "Key: Value", where each line is separated by 
the CRLF combination. If the user input is injected into the value section 
without properly escaping/removing CRLF characters it is possible to alter the 
HTTP headers structure.
HTTP Response Splitting is a new application attack technique which enables 
various new attacks such as web cache poisoning, cross user defacement, 
hijacking pages with sensitive user information and cross-site scripting (XSS). 
The attacker sends a single HTTP request that forces the web server to form an 
output stream, which is then interpreted by the target as two HTTP responses 
instead of one response.|Is it possible for a remote attacker to inject custom 
HTTP headers. For example, an attacker can inject session cookies or HTML code. 
This may conduct to vulnerabilities like XSS (cross-site scripting) or session 
fixation.|You need to restrict CR(0x13) and LF(0x10) from the user input or 
properly encode the output in order to prevent the injection of custom HTTP 
headers.|Web Server
Details
URL encoded GET input page was set to 
%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs
Injected header found:
SomeCustomInjectedHeader: injected_by_wvs
Request headers
GET 
/?page=%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs&showIncomplete=false
 HTTP/1.1
Referer: https://app30.goldmine.bdo.com.ph
Host: app30.goldmine.bdo.com.ph
Connection: Keep-alive
Accept-Encoding: gzip,deflate
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like 
Gecko)
Chrome/41.0.2228.0 Safari/537.21
Acunetix-Product: WVS/11.0 (Acunetix - WVSE)
Acunetix-Scanning-agreement: Third Party Scanning PROHIBITED
Acunetix-User-agreement: http://www.acunetix.com/wvs/disc.htm
Accept: */*
 
 
Web Server
Details
URL encoded GET input showIncomplete was set to 
%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs
 
Injected header found:
SomeCustomInjectedHeader: injected_by_wvs
Request headers
GET 
/?page=3&showIncomplete=%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs 
HTTP/1.1
Referer: https://app30.goldmine.bdo.com.ph
Host: app30.goldmine.bdo.com.ph
Connection: Keep-alive
Accept-Encoding: gzip,deflate
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like 
Gecko)
Chrome/41.0.2228.0 Safari/537.21
Acunetix-Product: WVS/11.0 (Acunetix - WVSE)
Acunetix-Scanning-agreement: Third Party Scanning PROHIBITED
Acunetix-User-agreement: http://www.acunetix.com/wvs/disc.htm
Accept: */*|Acunetix CRLF Injection Attack 
(http://www.acunetix.com/websitesecurity/crlf-injection.htm)
 
Whitepaper - HTTP Response Splitting 
(http://packetstormsecurity.org/papers/general/whitepaper_httpresponse.pdf)
 
Introduction to HTTP Response Splitting 
(http://www.securiteam.com/securityreviews/5WP0E2KFGK.html)
 
https://www.cvedetails.com/cve/CVE-2007-5615/
 
https://cwe.mitre.org/data/definitions/113.html|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27172:
-
Attachment: CRLF injection - Sheet1.pdf

> CRLF Injection/HTTP response splitting on spark embedded jetty servlet.
> ---
>
> Key: SPARK-27172
> URL: https://issues.apache.org/jira/browse/SPARK-27172
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Major
> Attachments: CRLF injection - Sheet1.pdf
>
>
> Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or 
> will there be any impact if we do upgrade it ?
> Please do refer on the provided attachment for more information.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27132) Improve file source V2 framework

2019-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27132.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24066
[https://github.com/apache/spark/pull/24066]

> Improve file source V2 framework
> 
>
> Key: SPARK-27132
> URL: https://issues.apache.org/jira/browse/SPARK-27132
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> During the migration of CSV V2, I find that we can improve the file source v2 
> framework by:
> 1. check duplicated column names in both read and write
> 2. Not all the file sources support filter push down. So remove 
> `SupportsPushDownFilters` from FileScanBuilder
> 3. The method `isSplitable` might require data source options. Add a new 
> member `options` to FileScan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27132) Improve file source V2 framework

2019-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27132:
---

Assignee: Gengliang Wang

> Improve file source V2 framework
> 
>
> Key: SPARK-27132
> URL: https://issues.apache.org/jira/browse/SPARK-27132
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> During the migration of CSV V2, I find that we can improve the file source v2 
> framework by:
> 1. check duplicated column names in both read and write
> 2. Not all the file sources support filter push down. So remove 
> `SupportsPushDownFilters` from FileScanBuilder
> 3. The method `isSplitable` might require data source options. Add a new 
> member `options` to FileScan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27136) Remove data source option check_files_exist

2019-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27136:
---

Assignee: Gengliang Wang

> Remove data source option check_files_exist
> ---
>
> Key: SPARK-27136
> URL: https://issues.apache.org/jira/browse/SPARK-27136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> The data source option check_files_exist is introduced in In 
> https://github.com/apache/spark/pull/23383 when the file source V2 framework 
> is implemented. In the PR, FileIndex was created as a member of FileTable, so 
> that we could implement partition pruning like 0f9fcab in the future. At that 
> time FileIndexes will always be created for file writes, so we needed the 
> option to decide whether to check file existence.
> After https://github.com/apache/spark/pull/23774, the option is not needed 
> anymore.  This PR is to clean the option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27172) CRLF Injection/HTTP response splitting on spark embedded jetty servlet.

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27172:
-
Description: 
Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or will 
there be any impact if we do upgrade it ?

Please do refer on the provided attachment for more information.
 

  was:
Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or will 
there be any impact if we do upgrade it ?

Please do refer on the provided attachment for more information.
|CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
|CRLF injection/HTTP response splitting|Medium|This script is possibly 
vulnerable to CRLF injection attacks.
HTTP headers have the structure "Key: Value", where each line is separated by 
the CRLF combination. If the user input is injected into the value section 
without properly escaping/removing CRLF characters it is possible to alter the 
HTTP headers structure.
HTTP Response Splitting is a new application attack technique which enables 
various new attacks such as web cache poisoning, cross user defacement, 
hijacking pages with sensitive user information and cross-site scripting (XSS). 
The attacker sends a single HTTP request that forces the web server to form an 
output stream, which is then interpreted by the target as two HTTP responses 
instead of one response.|Is it possible for a remote attacker to inject custom 
HTTP headers. For example, an attacker can inject session cookies or HTML code. 
This may conduct to vulnerabilities like XSS (cross-site scripting) or session 
fixation.|You need to restrict CR(0x13) and LF(0x10) from the user input or 
properly encode the output in order to prevent the injection of custom HTTP 
headers.|Web Server
Details
URL encoded GET input page was set to 
%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs
Injected header found:
SomeCustomInjectedHeader: injected_by_wvs
Request headers
GET 
/?page=%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs&showIncomplete=false
 HTTP/1.1
Referer: https://app30.goldmine.bdo.com.ph
Host: app30.goldmine.bdo.com.ph
Connection: Keep-alive
Accept-Encoding: gzip,deflate
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like 
Gecko)
Chrome/41.0.2228.0 Safari/537.21
Acunetix-Product: WVS/11.0 (Acunetix - WVSE)
Acunetix-Scanning-agreement: Third Party Scanning PROHIBITED
Acunetix-User-agreement: http://www.acunetix.com/wvs/disc.htm
Accept: */*
 
 
Web Server
Details
URL encoded GET input showIncomplete was set to 
%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs
 
Injected header found:
SomeCustomInjectedHeader: injected_by_wvs
Request headers
GET 
/?page=3&showIncomplete=%c4%8d%c4%8aSomeCustomInjectedHeader:%20injected_by_wvs 
HTTP/1.1
Referer: https://app30.goldmine.bdo.com.ph
Host: app30.goldmine.bdo.com.ph
Connection: Keep-alive
Accept-Encoding: gzip,deflate
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.21 (KHTML, like 
Gecko)
Chrome/41.0.2228.0 Safari/537.21
Acunetix-Product: WVS/11.0 (Acunetix - WVSE)
Acunetix-Scanning-agreement: Third Party Scanning PROHIBITED
Acunetix-User-agreement: http://www.acunetix.com/wvs/disc.htm
Accept: */*|Acunetix CRLF Injection Attack 
(http://www.acunetix.com/websitesecurity/crlf-injection.htm)
 
Whitepaper - HTTP Response Splitting 
(http://packetstormsecurity.org/papers/general/whitepaper_httpresponse.pdf)
 
Introduction to HTTP Response Splitting 
(http://www.securiteam.com/securityreviews/5WP0E2KFGK.html)
 
https://www.cvedetails.com/cve/CVE-2007-5615/
 
https://cwe.mitre.org/data/definitions/113.html|


> CRLF Injection/HTTP response splitting on spark embedded jetty servlet.
> ---
>
> Key: SPARK-27172
> URL: https://issues.apache.org/jira/browse/SPARK-27172
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Major
>
> Can we upgrade embedded jetty servlet on spark 1.6.2? Is this possible or 
> will there be any impact if we do upgrade it ?
> Please do refer on the provided attachment for more information.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27166) Improve `printSchema` to print up to the given level

2019-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27166.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24098

> Improve `printSchema` to print up to the given level
> 
>
> Key: SPARK-27166
> URL: https://issues.apache.org/jira/browse/SPARK-27166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> This issue aims to improve `printSchema` to be able to print up to the given 
> level of the schema.
> {code:java}
> scala> val df = Seq((1,(2,(3,4.toDF
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<_1: int, _2: 
> struct<_1: int, _2: int>>]
> scala> df.printSchema
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> | | |-- _1: integer (nullable = false)
> | | |-- _2: integer (nullable = false)
> scala> df.printSchema(1)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> scala> df.printSchema(2)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> scala> df.printSchema(3)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> | | |-- _1: integer (nullable = false)
> | | |-- _2: integer (nullable = false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Description: 
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community. 

Please do refer below for more information:
|CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
|Vulnerable Javascript library|Medium|You are using a vulnerable Javascript 
library. One or more vulnerabilities were reported for this version of the 
Javascript library. Consult Attack details and Web References for more 
information about the affected library and the vulnerabilities that were 
reported.|Consult References for more information.|Upgrade to the latest 
version.|/static/jquery-1.11.1.min.js
  
 Details
 Detected Javascript library jquery version 1.11.1. The version was detected 
from filename.|References:
 [https://github.com/jquery/jquery/issues/2432]
 [http://blog.jquery.com/2016/01/08/jquery-2-2-and-1-12-released/]
  
 [https://snyk.io/test/npm/jquery/1.11.1]
  
 related reference not directly with spark:
 
[https://community.hortonworks.com/questions/89874/ambari-jquery-172-upgrade-to-jquery191.html]|

 

Thank you,

 

  was:
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community. 

Please do refer below for more information:
|CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
|Vulnerable Javascript library|Medium|You are using a vulnerable Javascript 
library. One or more vulnerabilities were reported for this version of the 
Javascript library. Consult Attack details and Web References for more 
information about the affected library and the vulnerabilities that were 
reported.|Consult References for more information.|Upgrade to the latest 
version.|/static/jquery-1.11.1.min.js
 
Details
Detected Javascript library jquery version 1.11.1. The version was detected 
from filename.|References:
https://github.com/jquery/jquery/issues/2432
http://blog.jquery.com/2016/01/08/jquery-2-2-and-1-12-released/
 
https://snyk.io/test/npm/jquery/1.11.1
 
related reference not directly with spark:
https://community.hortonworks.com/questions/89874/ambari-jquery-172-upgrade-to-jquery191.html|

 

Thanks you,

 


> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per VA scan javascript library that we are currently using is vulnerable 
> and we wanted to address this vulnerability. Appreciate any help we could get 
> from the community. 
> Please do refer below for more information:
> |CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
> |Vulnerable Javascript library|Medium|You are using a vulnerable Javascript 
> library. One or more vulnerabilities were reported for this version of the 
> Javascript library. Consult Attack details and Web References for more 
> information about the affected library and the vulnerabilities that were 
> reported.|Consult References for more information.|Upgrade to the latest 
> version.|/static/jquery-1.11.1.min.js
>   
>  Details
>  Detected Javascript library jquery version 1.11.1. The version was detected 
> from filename.|References:
>  [https://github.com/jquery/jquery/issues/2432]
>  [http://blog.jquery.com/2016/01/08/jquery-2-2-and-1-12-released/]
>   
>  [https://snyk.io/test/npm/jquery/1.11.1]
>   
>  related reference not directly with spark:
>  
> [https://community.hortonworks.com/questions/89874/ambari-jquery-172-upgrade-to-jquery191.html]|
>  
> Thank you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27136) Remove data source option check_files_exist

2019-03-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27136.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24069
[https://github.com/apache/spark/pull/24069]

> Remove data source option check_files_exist
> ---
>
> Key: SPARK-27136
> URL: https://issues.apache.org/jira/browse/SPARK-27136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> The data source option check_files_exist is introduced in In 
> https://github.com/apache/spark/pull/23383 when the file source V2 framework 
> is implemented. In the PR, FileIndex was created as a member of FileTable, so 
> that we could implement partition pruning like 0f9fcab in the future. At that 
> time FileIndexes will always be created for file writes, so we needed the 
> option to decide whether to check file existence.
> After https://github.com/apache/spark/pull/23774, the option is not needed 
> anymore.  This PR is to clean the option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Garcia updated SPARK-27167:
-
Description: 
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community. 

Please do refer below for more information:
|CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
|Vulnerable Javascript library|Medium|You are using a vulnerable Javascript 
library. One or more vulnerabilities were reported for this version of the 
Javascript library. Consult Attack details and Web References for more 
information about the affected library and the vulnerabilities that were 
reported.|Consult References for more information.|Upgrade to the latest 
version.|/static/jquery-1.11.1.min.js
 
Details
Detected Javascript library jquery version 1.11.1. The version was detected 
from filename.|References:
https://github.com/jquery/jquery/issues/2432
http://blog.jquery.com/2016/01/08/jquery-2-2-and-1-12-released/
 
https://snyk.io/test/npm/jquery/1.11.1
 
related reference not directly with spark:
https://community.hortonworks.com/questions/89874/ambari-jquery-172-upgrade-to-jquery191.html|

 

Thanks you,

 

  was:
Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community.

 

Thanks,

 


> What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?
> -
>
> Key: SPARK-27167
> URL: https://issues.apache.org/jira/browse/SPARK-27167
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Jerry Garcia
>Priority: Minor
>
> Will there be a big impact on my system if my current 
> /static/jquery-1.11.1.min.js will be update to latest version ? 
> As per VA scan javascript library that we are currently using is vulnerable 
> and we wanted to address this vulnerability. Appreciate any help we could get 
> from the community. 
> Please do refer below for more information:
> |CVS|Severity|Description|Impact|Recommendation|Affected|Reference:|
> |Vulnerable Javascript library|Medium|You are using a vulnerable Javascript 
> library. One or more vulnerabilities were reported for this version of the 
> Javascript library. Consult Attack details and Web References for more 
> information about the affected library and the vulnerabilities that were 
> reported.|Consult References for more information.|Upgrade to the latest 
> version.|/static/jquery-1.11.1.min.js
>  
> Details
> Detected Javascript library jquery version 1.11.1. The version was detected 
> from filename.|References:
> https://github.com/jquery/jquery/issues/2432
> http://blog.jquery.com/2016/01/08/jquery-2-2-and-1-12-released/
>  
> https://snyk.io/test/npm/jquery/1.11.1
>  
> related reference not directly with spark:
> https://community.hortonworks.com/questions/89874/ambari-jquery-172-upgrade-to-jquery191.html|
>  
> Thanks you,
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27107.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0
   2.4.2

This is resolved via [https://github.com/apache/spark/pull/24096] and 
[https://github.com/apache/spark/pull/24097] .

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply

[jira] [Resolved] (SPARK-27165) Upgrade Apache ORC to 1.5.5

2019-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27165.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 3.0.0
   2.4.2

This is resolved via [https://github.com/apache/spark/pull/24096] and 
[https://github.com/apache/spark/pull/24097] .

> Upgrade Apache ORC to 1.5.5
> ---
>
> Key: SPARK-27165
> URL: https://issues.apache.org/jira/browse/SPARK-27165
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> This issue aims to update Apache ORC dependency to fix SPARK-27107 .
> {code:java}
> [ORC-452] Support converting MAP column from JSON to ORC
> Improvement
> [ORC-447] Change the docker scripts to keep a persistent m2 cache
> [ORC-463] Add `version` command
> [ORC-475] ORC reader should lazily get filesystem
> [ORC-476] Make SearchAgument kryo buffer size configurable{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27171) Support Full-Partiton limit in the first scan

2019-03-14 Thread deshanxiao (JIRA)
deshanxiao created SPARK-27171:
--

 Summary: Support Full-Partiton limit in the first scan
 Key: SPARK-27171
 URL: https://issues.apache.org/jira/browse/SPARK-27171
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: deshanxiao


SparkPlan#executeTake must pick element starting with one partition. Sometimes 
it will be slow for some query. Although, Spark is better at batch query. It's 
not bad to add a switch to allow user search all partitons for the first time 
in limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27142) Provide REST API for SQL level information

2019-03-14 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793268#comment-16793268
 ] 

Gengliang Wang commented on SPARK-27142:


+1 on the proposal. 

> Provide REST API for SQL level information
> --
>
> Key: SPARK-27142
> URL: https://issues.apache.org/jira/browse/SPARK-27142
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Minor
> Attachments: image-2019-03-13-19-29-26-896.png
>
>
> Currently for Monitoring Spark application SQL information is not available 
> from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that SQL level information can be found
>  
> Details: 
> https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27170) Better error message for syntax error with extraneous comma in the SQL parser

2019-03-14 Thread Wataru Yukawa (JIRA)
Wataru Yukawa created SPARK-27170:
-

 Summary: Better error message for syntax error with extraneous 
comma in the SQL parser
 Key: SPARK-27170
 URL: https://issues.apache.org/jira/browse/SPARK-27170
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wataru Yukawa


[~maropu], [~smilegator]

It was great to talk with you in Hadoop / Spark Conference Japan 2019.
Thanks in advance!
I filed this issue which I talked with you at that time.

We sometimes write a syntax error SQL with extraneous comma by mistake.
For example, here is the SQL with an extraneous comma in line 2.

{code}
SELECT distinct
,a
,b
,c
FROM ...' LIMIT 100
{code}

We have an error message in spark 2.4.0 but it's a little hard to understand in 
my feeling because line number is wrong.
{code}
cannot resolve '`distinct`' given input columns: [...]; line 1 pos 7;
'GlobalLimit 100
+- 'LocalLimit 100
+- 'Project ['distinct, ...]
+- Filter (...)
+- SubqueryAlias ...
+- HiveTableRelation ...
{code}

By the way, here is the error message in prestosql 305 and same sql.
Line number is correct and I guess an error message is better than sparksql.
{code}
line 2:5: mismatched input ','. Expecting: '*', , 
{code}

If sparksql error message improves, it would be great.

Thanks.
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26778) Implement file source V2 partitioning pruning

2019-03-14 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793263#comment-16793263
 ] 

Gengliang Wang commented on SPARK-26778:


Sorry, I meant file source partition pruning. I have updated the title.

> Implement file source V2 partitioning pruning
> -
>
> Key: SPARK-26778
> URL: https://issues.apache.org/jira/browse/SPARK-26778
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26778) Implement file source V2 partitioning pruning

2019-03-14 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-26778:
---
Summary: Implement file source V2 partitioning pruning  (was: Implement 
file source V2 partitioning )

> Implement file source V2 partitioning pruning
> -
>
> Key: SPARK-26778
> URL: https://issues.apache.org/jira/browse/SPARK-26778
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26343) Speed up running the kubernetes integration tests locally

2019-03-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26343.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23380
[https://github.com/apache/spark/pull/23380]

> Speed up running the kubernetes integration tests locally
> -
>
> Key: SPARK-26343
> URL: https://issues.apache.org/jira/browse/SPARK-26343
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.0.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 3.0.0
>
>
> The Kubernetes integration tests right now allow you to specify a docker tag 
> but even when you do it also requires a tgz to extract, but then it doesn't 
> really need that extracted version. We could make it easier/faster for folks 
> to run the integration tests locally by not requiring a distribution tar ball.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27169) number of active tasks is negative on executors page

2019-03-14 Thread acupple (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

acupple updated SPARK-27169:

Attachment: QQ20190315-102235.png

> number of active tasks is negative on executors page
> 
>
> Key: SPARK-27169
> URL: https://issues.apache.org/jira/browse/SPARK-27169
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: acupple
>Priority: Minor
> Attachments: QQ20190315-102215.png, QQ20190315-102235.png
>
>
> I use spark to process some data in hdfs and hbase, and the concurrency is 
> 16. 
> but when run some time, the active jobs will be thousands, and number of 
> active tasks are negative.
> Actually, these jobs are already done when I check driver logs
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27169) number of active tasks is negative on executors page

2019-03-14 Thread acupple (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

acupple updated SPARK-27169:

Description: 
I use spark to process some data in HDFS and HBASE, I use one thread consume 
message from a queue, and then submit to a thread pool(16 fix size)for spark 
processor.

But when run for some time, the active jobs will be thousands, and number of 
active tasks are negative.

Actually, these jobs are already done when I check driver logs。

 

  was:
I use spark to process some data in hdfs and hbase, and the concurrency is 16. 

but when run some time, the active jobs will be thousands, and number of active 
tasks are negative.

Actually, these jobs are already done when I check driver logs

 


> number of active tasks is negative on executors page
> 
>
> Key: SPARK-27169
> URL: https://issues.apache.org/jira/browse/SPARK-27169
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: acupple
>Priority: Minor
> Attachments: QQ20190315-102215.png, QQ20190315-102235.png
>
>
> I use spark to process some data in HDFS and HBASE, I use one thread consume 
> message from a queue, and then submit to a thread pool(16 fix size)for spark 
> processor.
> But when run for some time, the active jobs will be thousands, and number of 
> active tasks are negative.
> Actually, these jobs are already done when I check driver logs。
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27169) number of active tasks is negative on executors page

2019-03-14 Thread acupple (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

acupple updated SPARK-27169:

Attachment: QQ20190315-102215.png

> number of active tasks is negative on executors page
> 
>
> Key: SPARK-27169
> URL: https://issues.apache.org/jira/browse/SPARK-27169
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: acupple
>Priority: Minor
> Attachments: QQ20190315-102215.png, QQ20190315-102235.png
>
>
> I use spark to process some data in hdfs and hbase, and the concurrency is 
> 16. 
> but when run some time, the active jobs will be thousands, and number of 
> active tasks are negative.
> Actually, these jobs are already done when I check driver logs
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27169) number of active tasks is negative on executors page

2019-03-14 Thread acupple (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

acupple updated SPARK-27169:

Description: 
I use spark to process some data in hdfs and hbase, and the concurrency is 16. 

but when run some time, the active jobs will be thousands, and number of active 
tasks are negative.

Actually, these jobs are already done when I check driver logs

 

  was:
I use spark to process some data in hdfs and hbase, and the concurrency is 16. 

but when run some time, the active jobs will be thousands, and number of active 
tasks are negative.

Actually, these jobs are already done when I check driver logs

!image-2019-03-15-10-20-36-998.png|width=576,height=242!

!image-2019-03-15-10-21-16-478.png|width=577,height=258!

 


> number of active tasks is negative on executors page
> 
>
> Key: SPARK-27169
> URL: https://issues.apache.org/jira/browse/SPARK-27169
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.2
>Reporter: acupple
>Priority: Minor
>
> I use spark to process some data in hdfs and hbase, and the concurrency is 
> 16. 
> but when run some time, the active jobs will be thousands, and number of 
> active tasks are negative.
> Actually, these jobs are already done when I check driver logs
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27169) number of active tasks is negative on executors page

2019-03-14 Thread acupple (JIRA)
acupple created SPARK-27169:
---

 Summary: number of active tasks is negative on executors page
 Key: SPARK-27169
 URL: https://issues.apache.org/jira/browse/SPARK-27169
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.2
Reporter: acupple


I use spark to process some data in hdfs and hbase, and the concurrency is 16. 

but when run some time, the active jobs will be thousands, and number of active 
tasks are negative.

Actually, these jobs are already done when I check driver logs

!image-2019-03-15-10-20-36-998.png|width=576,height=242!

!image-2019-03-15-10-21-16-478.png|width=577,height=258!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27141) Use ConfigEntry for hardcoded configs Yarn

2019-03-14 Thread wangjiaochun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangjiaochun reopened SPARK-27141:
--

> Use ConfigEntry for hardcoded configs Yarn
> --
>
> Key: SPARK-27141
> URL: https://issues.apache.org/jira/browse/SPARK-27141
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: wangjiaochun
>Priority: Major
> Fix For: 3.0.0
>
>
> Some of following Yarn file related configs are not use ConfigEntry value,try 
> to replace them. 
> ApplicationMaster
> YarnAllocatorSuite
> ApplicationMasterSuite
> BaseYarnClusterSuite
> YarnClusterSuite



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27152) Column equality does not work for aliased columns.

2019-03-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793251#comment-16793251
 ] 

Hyukjin Kwon commented on SPARK-27152:
--

So, in which case is it important?

> Column equality does not work for aliased columns.
> --
>
> Key: SPARK-27152
> URL: https://issues.apache.org/jira/browse/SPARK-27152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ryan Radtke
>Priority: Minor
>
> assert($"zip".as("zip_code") equals $"zip".as("zip_code")) will return false
> assert($"zip" equals $"zip") will return true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27164) RDD.countApprox on empty RDDs schedules jobs which never complete

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27164:


Assignee: Apache Spark

> RDD.countApprox on empty RDDs schedules jobs which never complete 
> --
>
> Key: SPARK-27164
> URL: https://issues.apache.org/jira/browse/SPARK-27164
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.4.0
> Environment: macOS, Spark-2.4.0 with Hadoop 2.7 running on Java 11.0.1
> Also observed on:
> macOS, Spark-2.2.3 with Hadoop 2.7 running on Java 1.8.0_151
>Reporter: Ryan Moore
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2019-03-14 at 1.49.19 PM.png
>
>
> When calling `countApprox` on an RDD which has no partitions (such as those 
> created by `sparkContext.emptyRDD`) a job is scheduled with 0 stages and 0 
> tasks. That job appears under the "Active Jobs" in the Spark UI until it is 
> either killed or the Spark context is shut down.
>  
> {code:java}
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.1)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val ints = sc.makeRDD(Seq(1))
> ints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at 
> :24
> scala> ints.countApprox(1000)
> res0: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [1.000, 1.000])
> // PartialResult is returned, Scheduled job completed
> scala> ints.filter(_ => false).countApprox(1000)
> res1: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job completed
> scala> sc.emptyRDD[Int].countApprox(1000)
> res5: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> scala> sc.union(Nil : Seq[org.apache.spark.rdd.RDD[Int]]).countApprox(1000)
> res16: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27164) RDD.countApprox on empty RDDs schedules jobs which never complete

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27164:


Assignee: (was: Apache Spark)

> RDD.countApprox on empty RDDs schedules jobs which never complete 
> --
>
> Key: SPARK-27164
> URL: https://issues.apache.org/jira/browse/SPARK-27164
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.4.0
> Environment: macOS, Spark-2.4.0 with Hadoop 2.7 running on Java 11.0.1
> Also observed on:
> macOS, Spark-2.2.3 with Hadoop 2.7 running on Java 1.8.0_151
>Reporter: Ryan Moore
>Priority: Major
> Attachments: Screen Shot 2019-03-14 at 1.49.19 PM.png
>
>
> When calling `countApprox` on an RDD which has no partitions (such as those 
> created by `sparkContext.emptyRDD`) a job is scheduled with 0 stages and 0 
> tasks. That job appears under the "Active Jobs" in the Spark UI until it is 
> either killed or the Spark context is shut down.
>  
> {code:java}
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.1)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val ints = sc.makeRDD(Seq(1))
> ints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at 
> :24
> scala> ints.countApprox(1000)
> res0: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [1.000, 1.000])
> // PartialResult is returned, Scheduled job completed
> scala> ints.filter(_ => false).countApprox(1000)
> res1: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job completed
> scala> sc.emptyRDD[Int].countApprox(1000)
> res5: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> scala> sc.union(Nil : Seq[org.apache.spark.rdd.RDD[Int]]).countApprox(1000)
> res16: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27168) Add docker integration test for MsSql Server

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27168:


Assignee: Apache Spark

> Add docker integration test for MsSql Server
> 
>
> Key: SPARK-27168
> URL: https://issues.apache.org/jira/browse/SPARK-27168
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Assignee: Apache Spark
>Priority: Major
>
> Add docker integration test for MsSql Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27168) Add docker integration test for MsSql Server

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27168:


Assignee: (was: Apache Spark)

> Add docker integration test for MsSql Server
> 
>
> Key: SPARK-27168
> URL: https://issues.apache.org/jira/browse/SPARK-27168
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Add docker integration test for MsSql Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27168) Add docker integration test for MsSql Server

2019-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793234#comment-16793234
 ] 

Apache Spark commented on SPARK-27168:
--

User 'lipzhu' has created a pull request for this issue:
https://github.com/apache/spark/pull/24099

> Add docker integration test for MsSql Server
> 
>
> Key: SPARK-27168
> URL: https://issues.apache.org/jira/browse/SPARK-27168
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Add docker integration test for MsSql Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27100) dag-scheduler-event-loop" java.lang.StackOverflowError

2019-03-14 Thread KaiXu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793233#comment-16793233
 ] 

KaiXu commented on SPARK-27100:
---

hi [~hyukjin.kwon], the workload I'm running is ALS from Hibench, the code can 
be obtained from 
[here|https://github.com/intel-hadoop/HiBench/blob/master/sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/ALSExample.scala],
 and here is the [doc 
|https://github.com/intel-hadoop/HiBench/blob/master/docs/run-sparkbench.md] on 
how to build and run.

Steps to reproduce:
 # Follow above doc to config the Hibench based on your cluster.
 # Edit \{HIBENCH_HOME}/conf/benchmarks.lst, keep ml.als in this file to run 
ALS only.
 # Edit \{HIBENCH_HOME}/conf/hibench.conf, change the value of 
hibench.scale.profile to gigantic.
 # Edit \{HIBENCH_HOME}/conf/workloads/ml/al.conf, change the value of 
hibench.als.rank to 200, hibench.als.numIterations to 100
 # \{HIBENCH_HOME}/conf/run_all.sh, to start the test.
 # Wait to about 30 iterations, it will fail with StackOverflowError

> dag-scheduler-event-loop" java.lang.StackOverflowError
> --
>
> Key: SPARK-27100
> URL: https://issues.apache.org/jira/browse/SPARK-27100
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.3, 2.3.3
>Reporter: KaiXu
>Priority: Major
> Attachments: stderr
>
>
> ALS in Spark MLlib causes StackOverflow:
>  /opt/sparkml/spark213/bin/spark-submit  --properties-file 
> /opt/HiBench/report/als/spark/conf/sparkbench/spark.conf --class 
> com.intel.hibench.sparkbench.ml.ALSExample --master yarn-client 
> --num-executors 3 --executor-memory 322g 
> /opt/HiBench/sparkbench/assembly/target/sparkbench-assembly-7.1-SNAPSHOT-dist.jar
>  --numUsers 4 --numProducts 6 --rank 100 --numRecommends 20 
> --numIterations 100 --kryo false --implicitPrefs true --numProductBlocks -1 
> --numUserBlocks -1 --lambda 1.0 hdfs://bdw-slave20:8020/HiBench/ALS/Input
>  
> Exception in thread "dag-scheduler-event-loop" java.lang.StackOverflowError
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1534)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>  at 
> scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
>  at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
>  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
>  at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>  at 
> java.io.ObjectOutputStream.defaultWriteFields(Objec

[jira] [Updated] (SPARK-27168) Add docker integration test for MsSql Server

2019-03-14 Thread Zhu, Lipeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu, Lipeng updated SPARK-27168:

Issue Type: Test  (was: Bug)

> Add docker integration test for MsSql Server
> 
>
> Key: SPARK-27168
> URL: https://issues.apache.org/jira/browse/SPARK-27168
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Add docker integration test for MsSql Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27168) Add docker integration test for MsSql Server

2019-03-14 Thread Zhu, Lipeng (JIRA)
Zhu, Lipeng created SPARK-27168:
---

 Summary: Add docker integration test for MsSql Server
 Key: SPARK-27168
 URL: https://issues.apache.org/jira/browse/SPARK-27168
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zhu, Lipeng


Add docker integration test for MsSql Server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27164) RDD.countApprox on empty RDDs schedules jobs which never complete

2019-03-14 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793222#comment-16793222
 ] 

Ajith S commented on SPARK-27164:
-

i will be working on this

> RDD.countApprox on empty RDDs schedules jobs which never complete 
> --
>
> Key: SPARK-27164
> URL: https://issues.apache.org/jira/browse/SPARK-27164
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.4.0
> Environment: macOS, Spark-2.4.0 with Hadoop 2.7 running on Java 11.0.1
> Also observed on:
> macOS, Spark-2.2.3 with Hadoop 2.7 running on Java 1.8.0_151
>Reporter: Ryan Moore
>Priority: Major
> Attachments: Screen Shot 2019-03-14 at 1.49.19 PM.png
>
>
> When calling `countApprox` on an RDD which has no partitions (such as those 
> created by `sparkContext.emptyRDD`) a job is scheduled with 0 stages and 0 
> tasks. That job appears under the "Active Jobs" in the Spark UI until it is 
> either killed or the Spark context is shut down.
>  
> {code:java}
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.1)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val ints = sc.makeRDD(Seq(1))
> ints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at 
> :24
> scala> ints.countApprox(1000)
> res0: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [1.000, 1.000])
> // PartialResult is returned, Scheduled job completed
> scala> ints.filter(_ => false).countApprox(1000)
> res1: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job completed
> scala> sc.emptyRDD[Int].countApprox(1000)
> res5: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> scala> sc.union(Nil : Seq[org.apache.spark.rdd.RDD[Int]]).countApprox(1000)
> res16: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27070) DefaultPartitionCoalescer can lock up driver for hours

2019-03-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27070:
-

Assignee: Yuli Fiterman

> DefaultPartitionCoalescer can lock up driver for hours
> --
>
> Key: SPARK-27070
> URL: https://issues.apache.org/jira/browse/SPARK-27070
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
>Reporter: Yuli Fiterman
>Assignee: Yuli Fiterman
>Priority: Major
>
> We're running Spark on EMR reading large datasets from S3. When trying to 
> coalesce a UnionRDD of two large FileScanRDDs (each with a few million 
> partitions) into around 8k partitions the driver can stall for over an hour. 
>  
> Profiler shows that over 90% of the time is spent in TimSort which is invoked 
> by `pickBin`. This seems like a very inefficient way to find the least 
> occupied PartitionGroup. IMO a better way would just using the `min` method 
> on the ArrayBuffer of `PartitionGroup`s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27070) DefaultPartitionCoalescer can lock up driver for hours

2019-03-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27070.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23986
[https://github.com/apache/spark/pull/23986]

> DefaultPartitionCoalescer can lock up driver for hours
> --
>
> Key: SPARK-27070
> URL: https://issues.apache.org/jira/browse/SPARK-27070
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1, 2.3.2, 2.4.0
>Reporter: Yuli Fiterman
>Assignee: Yuli Fiterman
>Priority: Major
> Fix For: 3.0.0
>
>
> We're running Spark on EMR reading large datasets from S3. When trying to 
> coalesce a UnionRDD of two large FileScanRDDs (each with a few million 
> partitions) into around 8k partitions the driver can stall for over an hour. 
>  
> Profiler shows that over 90% of the time is spent in TimSort which is invoked 
> by `pickBin`. This seems like a very inefficient way to find the least 
> occupied PartitionGroup. IMO a better way would just using the `min` method 
> on the ArrayBuffer of `PartitionGroup`s



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26176:
--
Issue Type: Improvement  (was: Bug)

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 3.0 failed 1 times, mo

[jira] [Updated] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26176:
--
Priority: Minor  (was: Major)

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 3.0 failed 1 times, most rec

[jira] [Assigned] (SPARK-26990) Difference in handling of mixed-case partition column names after SPARK-26188

2019-03-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-26990:
---

Assignee: Gengliang Wang

> Difference in handling of mixed-case partition column names after SPARK-26188
> -
>
> Key: SPARK-26990
> URL: https://issues.apache.org/jira/browse/SPARK-26990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Bruce Robbins
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> I noticed that the [PR for 
> SPARK-26188|https://github.com/apache/spark/pull/23165] changed how 
> mixed-cased partition columns are handled when the user provides a schema.
> Say I have this file structure (note that each instance of `pS` is mixed 
> case):
> {noformat}
> bash-3.2$ find partitioned5 -type d
> partitioned5
> partitioned5/pi=2
> partitioned5/pi=2/pS=foo
> partitioned5/pi=2/pS=bar
> partitioned5/pi=1
> partitioned5/pi=1/pS=foo
> partitioned5/pi=1/pS=bar
> bash-3.2$
> {noformat}
> If I load the file with a user-provided schema in 2.4 (before the PR was 
> committed) or 2.3, I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- ps: string (nullable = true)
> scala>
> {noformat}
> However, using 2.4 after the PR was committed. I see:
> {noformat}
> scala> val df = spark.read.schema("intField int, pi int, ps 
> string").parquet("partitioned5")
> df: org.apache.spark.sql.DataFrame = [intField: int, pi: int ... 1 more field]
> scala> df.printSchema
> root
>  |-- intField: integer (nullable = true)
>  |-- pi: integer (nullable = true)
>  |-- pS: string (nullable = true)
> scala>
> {noformat}
> Spark is picking up the mixed-case column name {{pS}} from the directory 
> name, not the lower-case {{ps}} from my specified schema.
> In all tests, {{spark.sql.caseSensitive}} is set to the default (false).
> Not sure is this is an bug, but it is a difference.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27167) What would be the possible impact if I upgrade /static/jquery-1.11.1.min.js ?

2019-03-14 Thread Jerry Garcia (JIRA)
Jerry Garcia created SPARK-27167:


 Summary: What would be the possible impact if I upgrade 
/static/jquery-1.11.1.min.js ?
 Key: SPARK-27167
 URL: https://issues.apache.org/jira/browse/SPARK-27167
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.6.2
Reporter: Jerry Garcia


Will there be a big impact on my system if my current 
/static/jquery-1.11.1.min.js will be update to latest version ? 

As per VA scan javascript library that we are currently using is vulnerable and 
we wanted to address this vulnerability. Appreciate any help we could get from 
the community.

 

Thanks,

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27166) Improve `printSchema` to print up to the given level

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27166:


Assignee: (was: Apache Spark)

> Improve `printSchema` to print up to the given level
> 
>
> Key: SPARK-27166
> URL: https://issues.apache.org/jira/browse/SPARK-27166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to improve `printSchema` to be able to print up to the given 
> level of the schema.
> {code:java}
> scala> val df = Seq((1,(2,(3,4.toDF
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<_1: int, _2: 
> struct<_1: int, _2: int>>]
> scala> df.printSchema
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> | | |-- _1: integer (nullable = false)
> | | |-- _2: integer (nullable = false)
> scala> df.printSchema(1)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> scala> df.printSchema(2)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> scala> df.printSchema(3)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> | | |-- _1: integer (nullable = false)
> | | |-- _2: integer (nullable = false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27158) dev/mima and dev/scalastyle support dynamic profiles

2019-03-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27158.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/24089

> dev/mima and dev/scalastyle support dynamic profiles
> 
>
> Key: SPARK-27158
> URL: https://issues.apache.org/jira/browse/SPARK-27158
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27158) dev/mima and dev/scalastyle support dynamic profiles

2019-03-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27158:


Assignee: Yuming Wang

> dev/mima and dev/scalastyle support dynamic profiles
> 
>
> Key: SPARK-27158
> URL: https://issues.apache.org/jira/browse/SPARK-27158
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27166) Improve `printSchema` to print up to the given level

2019-03-14 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27166:
-

 Summary: Improve `printSchema` to print up to the given level
 Key: SPARK-27166
 URL: https://issues.apache.org/jira/browse/SPARK-27166
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to improve `printSchema` to be able to print up to the given 
level of the schema.
{code:java}
scala> val df = Seq((1,(2,(3,4.toDF
df: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<_1: int, _2: 
struct<_1: int, _2: int>>]

scala> df.printSchema
root
|-- _1: integer (nullable = false)
|-- _2: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: struct (nullable = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)

scala> df.printSchema(1)
root
|-- _1: integer (nullable = false)
|-- _2: struct (nullable = true)

scala> df.printSchema(2)
root
|-- _1: integer (nullable = false)
|-- _2: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: struct (nullable = true)

scala> df.printSchema(3)
root
|-- _1: integer (nullable = false)
|-- _2: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: struct (nullable = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false){code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27166) Improve `printSchema` to print up to the given level

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27166:


Assignee: Apache Spark

> Improve `printSchema` to print up to the given level
> 
>
> Key: SPARK-27166
> URL: https://issues.apache.org/jira/browse/SPARK-27166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue aims to improve `printSchema` to be able to print up to the given 
> level of the schema.
> {code:java}
> scala> val df = Seq((1,(2,(3,4.toDF
> df: org.apache.spark.sql.DataFrame = [_1: int, _2: struct<_1: int, _2: 
> struct<_1: int, _2: int>>]
> scala> df.printSchema
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> | | |-- _1: integer (nullable = false)
> | | |-- _2: integer (nullable = false)
> scala> df.printSchema(1)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> scala> df.printSchema(2)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> scala> df.printSchema(3)
> root
> |-- _1: integer (nullable = false)
> |-- _2: struct (nullable = true)
> | |-- _1: integer (nullable = false)
> | |-- _2: struct (nullable = true)
> | | |-- _1: integer (nullable = false)
> | | |-- _2: integer (nullable = false){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793118#comment-16793118
 ] 

Apache Spark commented on SPARK-27107:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/24096

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.

[jira] [Assigned] (SPARK-27165) Upgrade Apache ORC to 1.5.5

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27165:


Assignee: Apache Spark

> Upgrade Apache ORC to 1.5.5
> ---
>
> Key: SPARK-27165
> URL: https://issues.apache.org/jira/browse/SPARK-27165
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue aims to update Apache ORC dependency to fix SPARK-27107 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793125#comment-16793125
 ] 

Apache Spark commented on SPARK-27107:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/24097

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.

[jira] [Assigned] (SPARK-27165) Upgrade Apache ORC to 1.5.5

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27165:


Assignee: (was: Apache Spark)

> Upgrade Apache ORC to 1.5.5
> ---
>
> Key: SPARK-27165
> URL: https://issues.apache.org/jira/browse/SPARK-27165
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to update Apache ORC dependency to fix SPARK-27107 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27165) Upgrade Apache ORC to 1.5.5

2019-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27165:
--
Description: 
This issue aims to update Apache ORC dependency to fix SPARK-27107 .
{code:java}
[ORC-452] Support converting MAP column from JSON to ORC
Improvement
[ORC-447] Change the docker scripts to keep a persistent m2 cache
[ORC-463] Add `version` command
[ORC-475] ORC reader should lazily get filesystem
[ORC-476] Make SearchAgument kryo buffer size configurable{code}

  was:This issue aims to update Apache ORC dependency to fix SPARK-27107 .


> Upgrade Apache ORC to 1.5.5
> ---
>
> Key: SPARK-27165
> URL: https://issues.apache.org/jira/browse/SPARK-27165
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to update Apache ORC dependency to fix SPARK-27107 .
> {code:java}
> [ORC-452] Support converting MAP column from JSON to ORC
> Improvement
> [ORC-447] Change the docker scripts to keep a persistent m2 cache
> [ORC-463] Add `version` command
> [ORC-475] ORC reader should lazily get filesystem
> [ORC-476] Make SearchAgument kryo buffer size configurable{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27107:


Assignee: Apache Spark

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Assignee: Apache Spark
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  

[jira] [Assigned] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27107:


Assignee: (was: Apache Spark)

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.r

[jira] [Updated] (SPARK-27165) Upgrade Apache ORC to 1.5.5

2019-03-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27165:
--
Description: This issue aims to update Apache ORC dependency to fix 
SPARK-27107 .  (was: This issue aims to update Apache ORC dependency to fix 
SPARK-27160.)

> Upgrade Apache ORC to 1.5.5
> ---
>
> Key: SPARK-27165
> URL: https://issues.apache.org/jira/browse/SPARK-27165
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.1, 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to update Apache ORC dependency to fix SPARK-27107 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27165) Upgrade Apache ORC to 1.5.5

2019-03-14 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27165:
-

 Summary: Upgrade Apache ORC to 1.5.5
 Key: SPARK-27165
 URL: https://issues.apache.org/jira/browse/SPARK-27165
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.1, 3.0.0
Reporter: Dongjoon Hyun


This issue aims to update Apache ORC dependency to fix SPARK-27160.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793110#comment-16793110
 ] 

Dongjoon Hyun commented on SPARK-27107:
---

The vote passed. I'm preparing the PRs.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.appl

[jira] [Comment Edited] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-03-14 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793082#comment-16793082
 ] 

Martin Loncaric edited comment on SPARK-27098 at 3/14/19 9:28 PM:
--

[~ste...@apache.org] Does this make more sense to you? This seems to suggest a 
bug in either Spark or Hadoop, but do you have a more specific idea of where to 
look?


was (Author: mwlon):
[~ste...@apache.org] Does this make more sense to you? This seems to suggest a 
bug in either Spark or Hadoop, but do you have a better idea of where to look?

> Flaky missing file parts when writing to Ceph without error
> ---
>
> Key: SPARK-27098
> URL: https://issues.apache.org/jira/browse/SPARK-27098
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Attachments: sanitized_stdout_1.txt
>
>
> https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233
> Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
> occasionally a file part will be missing; i.e. part 3 here:
> ```
> > aws s3 ls my-bucket/folder/
> 2019-02-28 13:07:21  0 _SUCCESS
> 2019-02-28 13:06:58   79428651 
> part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:06:59   79586172 
> part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:00   79561910 
> part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:01   79192617 
> part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:07   79364413 
> part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:08   79623254 
> part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79445030 
> part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79474923 
> part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:11   79477310 
> part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:12   79331453 
> part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79567600 
> part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79388012 
> part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:14   79308387 
> part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:15   79455483 
> part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:17   79512342 
> part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79403307 
> part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79617769 
> part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:19   79333534 
> part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:20   79543324 
> part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> ```
> However, the write succeeds and leaves a _SUCCESS file.
> This can be caught by additionally checking afterward whether the number of 
> written file parts agrees with the number of partitions, but Spark should at 
> least fail on its own and leave a meaningful stack trace in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-03-14 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793082#comment-16793082
 ] 

Martin Loncaric commented on SPARK-27098:
-

[~ste...@apache.org] Does this make more sense to you? This seems to suggest a 
bug in either Spark or Hadoop, but do you have a better idea of where to look?

> Flaky missing file parts when writing to Ceph without error
> ---
>
> Key: SPARK-27098
> URL: https://issues.apache.org/jira/browse/SPARK-27098
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Attachments: sanitized_stdout_1.txt
>
>
> https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233
> Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
> occasionally a file part will be missing; i.e. part 3 here:
> ```
> > aws s3 ls my-bucket/folder/
> 2019-02-28 13:07:21  0 _SUCCESS
> 2019-02-28 13:06:58   79428651 
> part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:06:59   79586172 
> part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:00   79561910 
> part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:01   79192617 
> part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:07   79364413 
> part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:08   79623254 
> part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79445030 
> part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79474923 
> part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:11   79477310 
> part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:12   79331453 
> part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79567600 
> part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79388012 
> part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:14   79308387 
> part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:15   79455483 
> part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:17   79512342 
> part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79403307 
> part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79617769 
> part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:19   79333534 
> part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:20   79543324 
> part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> ```
> However, the write succeeds and leaves a _SUCCESS file.
> This can be caught by additionally checking afterward whether the number of 
> written file parts agrees with the number of partitions, but Spark should at 
> least fail on its own and leave a meaningful stack trace in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-03-14 Thread Martin Loncaric (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793080#comment-16793080
 ] 

Martin Loncaric commented on SPARK-27098:
-

I've gotten the debug logs for (1.), but can't make much of them. In this case, 
`part-0-` was missing:

{{Exception in thread "main" java.lang.AssertionError: assertion failed: 
Expected to write dataframe with 20 partitions in s3a://my-bucket/my_folder but 
instead found 19 written parts!
  1552587026347 82681618 
part-1-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587027399 82631123 
part-2-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587028592 82513038 
part-3-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587029544 82325322 
part-4-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587030573 82497917 
part-5-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587031590 82736624 
part-6-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587032449 82573267 
part-7-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587033351 82590538 
part-8-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587034582 82617979 
part-9-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587035817 82430474 
part-00010-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587036808 82688230 
part-00011-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587037744 8252 
part-00012-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587039017 82434976 
part-00013-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587039919 82535772 
part-00014-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587040884 82612890 
part-00015-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587041898 82535110 
part-00016-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587042829 82735449 
part-00017-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587043744 82460648 
part-00018-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  1552587044641 82658185 
part-00019-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
  at scala.Predef$.assert(Predef.scala:170)}}

Looking at stdout for the driver, I find that there is absolutely no mention of 
part-0, but the other parts (i.e. part-1) have various logs, including 
the "rename path" ones you mentioned, like so:

{{2019-03-14 18:10:26 DEBUG S3AFileSystem:449 - Rename path 
s3a://my-bucket/my/folder/_temporary/0/task_20190314180906_0016_m_01/part-1-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet
 to 
s3a://my-bucket/my/folder/part-1-5e21727b-508e-4246-b47c-c68c98c04f50-c000.snappy.parquet}}

I have attached all the debugging related to part-1 here. As mentioned, 
there is nothing for the missing part-0 (in other runs, it was a different 
part missing, so there is nothing special about 0, just coincidence). 

[^sanitized_stdout_1.txt] 

> Flaky missing file parts when writing to Ceph without error
> ---
>
> Key: SPARK-27098
> URL: https://issues.apache.org/jira/browse/SPARK-27098
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Attachments: sanitized_stdout_1.txt
>
>
> https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233
> Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
> occasionally a file part will be missing; i.e. part 3 here:
> ```
> > aws s3 ls my-bucket/folder/
> 2019-02-28 13:07:21  0 _SUCCESS
> 2019-02-28 13:06:58   79428651 
> part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:06:59   79586172 
> part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:00   79561910 
> part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:01   79192617 
> part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:07   79364413 
> part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:08   79623254 
> part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79445030 
> part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79474923 
> part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:11   79477310 
> part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:12   79331453 
> part-00010-5789

[jira] [Updated] (SPARK-27098) Flaky missing file parts when writing to Ceph without error

2019-03-14 Thread Martin Loncaric (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-27098:

Attachment: sanitized_stdout_1.txt

> Flaky missing file parts when writing to Ceph without error
> ---
>
> Key: SPARK-27098
> URL: https://issues.apache.org/jira/browse/SPARK-27098
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Martin Loncaric
>Priority: Major
> Attachments: sanitized_stdout_1.txt
>
>
> https://stackoverflow.com/questions/54935822/spark-s3a-write-omits-upload-part-without-failure/55031233?noredirect=1#comment96835218_55031233
> Using 2.4.0 with Hadoop 2.7, hadoop-aws 2.7.5, and the Ceph S3 endpoint. 
> occasionally a file part will be missing; i.e. part 3 here:
> ```
> > aws s3 ls my-bucket/folder/
> 2019-02-28 13:07:21  0 _SUCCESS
> 2019-02-28 13:06:58   79428651 
> part-0-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:06:59   79586172 
> part-1-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:00   79561910 
> part-2-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:01   79192617 
> part-4-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:07   79364413 
> part-5-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:08   79623254 
> part-6-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79445030 
> part-7-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:10   79474923 
> part-8-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:11   79477310 
> part-9-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:12   79331453 
> part-00010-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79567600 
> part-00011-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:13   79388012 
> part-00012-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:14   79308387 
> part-00013-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:15   79455483 
> part-00014-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:17   79512342 
> part-00015-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79403307 
> part-00016-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:18   79617769 
> part-00017-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:19   79333534 
> part-00018-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> 2019-02-28 13:07:20   79543324 
> part-00019-5789ebf5-b55d-4715-8bb5-dfc5c4e4b999-c000.snappy.parquet
> ```
> However, the write succeeds and leaves a _SUCCESS file.
> This can be caught by additionally checking afterward whether the number of 
> written file parts agrees with the number of partitions, but Spark should at 
> least fail on its own and leave a meaningful stack trace in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27164) RDD.countApprox on empty RDDs schedules jobs which never complete

2019-03-14 Thread Ryan Moore (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Moore updated SPARK-27164:
---
Attachment: Screen Shot 2019-03-14 at 1.49.19 PM.png

> RDD.countApprox on empty RDDs schedules jobs which never complete 
> --
>
> Key: SPARK-27164
> URL: https://issues.apache.org/jira/browse/SPARK-27164
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.3, 2.4.0
> Environment: macOS, Spark-2.4.0 with Hadoop 2.7 running on Java 11.0.1
> Also observed on:
> macOS, Spark-2.2.3 with Hadoop 2.7 running on Java 1.8.0_151
>Reporter: Ryan Moore
>Priority: Major
> Attachments: Screen Shot 2019-03-14 at 1.49.19 PM.png
>
>
> When calling `countApprox` on an RDD which has no partitions (such as those 
> created by `sparkContext.emptyRDD`) a job is scheduled with 0 stages and 0 
> tasks. That job appears under the "Active Jobs" in the Spark UI until it is 
> either killed or the Spark context is shut down.
>  
> {code:java}
> Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.1)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val ints = sc.makeRDD(Seq(1))
> ints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at 
> :24
> scala> ints.countApprox(1000)
> res0: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [1.000, 1.000])
> // PartialResult is returned, Scheduled job completed
> scala> ints.filter(_ => false).countApprox(1000)
> res1: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job completed
> scala> sc.emptyRDD[Int].countApprox(1000)
> res5: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> scala> sc.union(Nil : Seq[org.apache.spark.rdd.RDD[Int]]).countApprox(1000)
> res16: 
> org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble]
>  = (final: [0.000, 0.000])
> // PartialResult is returned, Scheduled job is ACTIVE but never completes
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27164) RDD.countApprox on empty RDDs schedules jobs which never complete

2019-03-14 Thread Ryan Moore (JIRA)
Ryan Moore created SPARK-27164:
--

 Summary: RDD.countApprox on empty RDDs schedules jobs which never 
complete 
 Key: SPARK-27164
 URL: https://issues.apache.org/jira/browse/SPARK-27164
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0, 2.2.3
 Environment: macOS, Spark-2.4.0 with Hadoop 2.7 running on Java 11.0.1

Also observed on:

macOS, Spark-2.2.3 with Hadoop 2.7 running on Java 1.8.0_151
Reporter: Ryan Moore


When calling `countApprox` on an RDD which has no partitions (such as those 
created by `sparkContext.emptyRDD`) a job is scheduled with 0 stages and 0 
tasks. That job appears under the "Active Jobs" in the Spark UI until it is 
either killed or the Spark context is shut down.

 
{code:java}
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val ints = sc.makeRDD(Seq(1))
ints: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at makeRDD at 
:24

scala> ints.countApprox(1000)
res0: 
org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble] 
= (final: [1.000, 1.000])
// PartialResult is returned, Scheduled job completed

scala> ints.filter(_ => false).countApprox(1000)
res1: 
org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble] 
= (final: [0.000, 0.000])
// PartialResult is returned, Scheduled job completed

scala> sc.emptyRDD[Int].countApprox(1000)
res5: 
org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble] 
= (final: [0.000, 0.000])
// PartialResult is returned, Scheduled job is ACTIVE but never completes

scala> sc.union(Nil : Seq[org.apache.spark.rdd.RDD[Int]]).countApprox(1000)
res16: 
org.apache.spark.partial.PartialResult[org.apache.spark.partial.BoundedDouble] 
= (final: [0.000, 0.000])
// PartialResult is returned, Scheduled job is ACTIVE but never completes


{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27145) Close store after test, in the SQLAppStatusListenerSuite

2019-03-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27145.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24079
[https://github.com/apache/spark/pull/24079]

> Close store after test, in the SQLAppStatusListenerSuite
> 
>
> Key: SPARK-27145
> URL: https://issues.apache.org/jira/browse/SPARK-27145
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
>
> We create many stores in the  SQLAppStatusListenerSuite, but we need to the 
> close store after test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27145) Close store after test, in the SQLAppStatusListenerSuite

2019-03-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27145:
--

Assignee: shahid

> Close store after test, in the SQLAppStatusListenerSuite
> 
>
> Key: SPARK-27145
> URL: https://issues.apache.org/jira/browse/SPARK-27145
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.3, 2.4.0, 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
>
> We create many stores in the  SQLAppStatusListenerSuite, but we need to the 
> close store after test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27142) Provide REST API for SQL level information

2019-03-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792975#comment-16792975
 ] 

Marcelo Vanzin commented on SPARK-27142:


I'm not sure I understand your point, Sean. We expose all the data about jobs 
and streaming in the REST API, why would we not want to expose SQL?

> Provide REST API for SQL level information
> --
>
> Key: SPARK-27142
> URL: https://issues.apache.org/jira/browse/SPARK-27142
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Minor
> Attachments: image-2019-03-13-19-29-26-896.png
>
>
> Currently for Monitoring Spark application SQL information is not available 
> from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that SQL level information can be found
>  
> Details: 
> https://issues.apache.org/jira/browse/SPARK-27142?focusedCommentId=16791728&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16791728



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5997) Increase partition count without performing a shuffle

2019-03-14 Thread nirav patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792964#comment-16792964
 ] 

nirav patel edited comment on SPARK-5997 at 3/14/19 6:56 PM:
-

Adding another possible use case for this ask - I am hitting 
IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to 
write unpartitioned Dataframe to parquet. Error is due to data block exceed 2GB 
in size before writing to disk. Solution is to repartition the Dataframe 
(Dataset) . I can do it but I don't want to cause shuffle when I increase 
number of partitions with repartition API.


was (Author: tenstriker):
Adding another possible use case for this ask - I am hitting 
IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to 
write unpartitioned Dataframe to parquet. Error is due to shuffleblock exceed 
2GB in size. Solution is to repartition the Dataframe (Dataset) . I can do it 
but I don't want to cause shuffle when I increase number of partitions with 
repartition API.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27006) SPIP: .NET bindings for Apache Spark

2019-03-14 Thread Terry Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-27006:
--
Priority: Major  (was: Minor)

> SPIP: .NET bindings for Apache Spark
> 
>
> Key: SPARK-27006
> URL: https://issues.apache.org/jira/browse/SPARK-27006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>   Original Estimate: 4,032h
>  Remaining Estimate: 4,032h
>
> h4. Background and Motivation: 
> Apache Spark provides programming language support for Scala/Java (native), 
> and extensions for Python and R. While a variety of other language extensions 
> are possible to include in Apache Spark, .NET would bring one of the largest 
> developer community to the table. Presently, no good Big Data solution exists 
> for .NET developers in open source.  This SPIP aims at discussing how we can 
> bring Apache Spark goodness to the .NET development platform.  
> .NET is a free, cross-platform, open source developer platform for building 
> many different types of applications. With .NET, you can use multiple 
> languages, editors, and libraries to build for web, mobile, desktop, gaming, 
> and IoT types of applications. Even with .NET serving millions of developers, 
> there is no good Big Data solution that exists today, which this SPIP aims to 
> address.  
> The .NET developer community is one of the largest programming language 
> communities in the world. Its flagship programming language C# is listed as 
> one of the most popular programming languages in a variety of articles and 
> statistics: 
>  * Most popular Technologies on Stack Overflow: 
> [https://insights.stackoverflow.com/survey/2018/#most-popular-technologies|https://insights.stackoverflow.com/survey/2018/]
>   
>  * Most popular languages on GitHub 2018: 
> [https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10#2-java-9|https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10]
>  
>  * 1M+ new developers last 1 year  
>  * Second most demanded technology on LinkedIn 
>  * Top 30 High velocity OSS projects on GitHub 
> Including a C# language extension in Apache Spark will enable millions of 
> .NET developers to author Big Data applications in their preferred 
> programming language, developer environment, and tooling support. We aim to 
> promote the .NET bindings for Spark through engagements with the Spark 
> community (e.g., we are scheduled to present an early prototype at the SF 
> Spark Summit 2019) and the .NET developer community (e.g., similar 
> presentations will be held at .NET developer conferences this year).  As 
> such, we believe that our efforts will help grow the Spark community by 
> making it accessible to the millions of .NET developers. 
> Furthermore, our early discussions with some large .NET development teams got 
> an enthusiastic reception. 
> We recognize that earlier attempts at this goal (specifically Mobius 
> [https://github.com/Microsoft/Mobius]) were unsuccessful primarily due to the 
> lack of communication with the Spark community. Therefore, another goal of 
> this proposal is to not only develop .NET bindings for Spark in open source, 
> but also continuously seek feedback from the Spark community via posted 
> Jira’s (like this one) and the Spark developer mailing list. Our hope is that 
> through these engagements, we can build a community of developers that are 
> eager to contribute to this effort or want to leverage the resulting .NET 
> bindings for Spark in their respective Big Data applications. 
> h4. Target Personas: 
> .NET developers looking to build big data solutions.  
> h4. Goals: 
> Our primary goal is to help grow Apache Spark by making it accessible to the 
> large .NET developer base and ecosystem. We will also look for opportunities 
> to generalize the interop layers for Spark for adding other language 
> extensions in the future. [SPARK-26257]( 
> https://issues.apache.org/jira/browse/SPARK-26257) proposes such a 
> generalized interop layer, which we hope to address over the course of this 
> project.  
> Another important goal for us is to not only enable Spark as an application 
> solution for .NET developers, but also opening the door for .NET developers 
> to make contributions to Apache Spark itself.   
> Lastly, we aim to develop a .NET extension in the open, while continually 
> engaging with the Spark community for feedback on designs and code. We will 
> welcome PRs from the Spark community throughout this project and aim to grow 
> a community of developers that want to contribute to this project.  
> h4. Non-Goals: 
> This proposal is focused on adding .NET bindings to Apache Spark, a

[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2019-03-14 Thread nirav patel (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792964#comment-16792964
 ] 

nirav patel commented on SPARK-5997:


Adding another possible use case for this ask - I am hitting 
IllegalArgumentException: Size exceeds Integer.MAX_VALUE error when trying to 
write unpartitioned Dataframe to parquet. Error is due to shuffleblock exceed 
2GB in size. Solution is to repartition the Dataframe (Dataset) . I can do it 
but I don't want to cause shuffle when I increase number of partitions with 
repartition API.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27006) SPIP: .NET bindings for Apache Spark

2019-03-14 Thread Tyson Condie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792939#comment-16792939
 ] 

Tyson Condie edited comment on SPARK-27006 at 3/14/19 6:23 PM:
---

I would like to briefly illuminate what I think this SPIP is trying to 
accomplish. I have worked in the Apache community for the better part of my 
career. Early on doing research at UC Berkeley related to Hadoop, then joining 
the Pig team at Yahoo! Research, and being part of the Microsoft CISL team that 
created Apache REEF, which turned out to be Microsoft’s first ever top-level 
Apache project and remains so to this day. I also had the brief pleasure of 
working with the Structured Stream team at Databricks and witnessed first-hand 
some of the exceptional minds behind Apache Spark.

So, what is this SPIP about? In my honest opinion, it is about bringing two 
very large communities together under a common shared goal: *to democratize 
data for all developers*. Given my roots, I am a Java developer at heart, but I 
see a tremendous value in the .NET stack and in its languages. Not surprisingly 
then, I see a significant barrier of entry when telling long time .NET 
developers that if they want to use Apache Spark, they must code in either 
Scala/Java, Python, or R. The .NET team conducted a survey (with 1000+ 
responses) revealing a strong desire from the .NET developer community to learn 
and use Spark. This SPIP is about making that process much more familiar, but 
that’s not all its about. 

This SPIP is about the Microsoft community wanting to learn and contribute to 
the Apache Spark community, and we are fully funded to do just that. Our 
leadership team includes Michael Rys and Rahul Potharaju from the Big Data 
organization, along with Ankit Asthana and Dan Moseley from .NET organization. 
Our development team includes Terry Kim, Steve Suh, Stephen Toub, Eric Erhardt, 
Aaron Robinson, and me, where I am again in the company of equally exceptional 
minds. Together, our goal is to develop .NET bindings for Spark in accordance 
to best practices from the Apache Foundation and Spark guidelines. We would 
welcome the opportunity to partner with leaders in the Apache Spark community, 
not only for their guidance on the work items described in this SPIP, but also 
on engagements that will bring our communities closer together and lead us to 
mutually beneficial outcomes.  

Regarding the work items in this SPIP, as recommended by earlier comments, we 
will develop externally (and openly) on a fork of Apache Spark. We only ask 
that a shepherd be available to provide us with occasional guidance towards 
getting our fork in a state that is acceptable for a contribution back to 
Apache Spark master. We recognize that such a contribution will not happen 
overnight, and that we will need to prove to the Spark community that we will 
continue to maintain it for the foreseeable future. That is why building a 
+diverse+ community is a very high priority for us, as it will ensure the 
future investments in .NET bindings for Apache Spark. All of this will take 
time. For now, we only ask if there is a Spark PMC member who is willing to 
step up and be our shepherd. 

Thank you for reading this far and we look forward to seeing you at the SF 
Spark Summit in April where we will be presenting our early progress on 
enabling .NET bindings for Apache Spark. 

 


was (Author: tcondie):
I would like to briefly illuminate what I think this SPIP is trying to 
accomplish. I have worked in the Apache community for the better part of my 
career. Early on doing research at UC Berkeley related to Hadoop, then joining 
the Pig team at Yahoo! Research, and being part of the Microsoft CISL team that 
created Apache REEF, which turned out to be Microsoft’s first ever top-level 
Apache project and remains so to this day. I also had the brief pleasure of 
working with the Structured Stream team at Databricks and witnessed first-hand 
some of the exceptional minds behind Apache Spark.

So, what is this SPIP about? In my honest opinion, it is about bringing two 
very large communities together under a common shared goal: *to democratize 
data for all developers*. Given my roots, I am a Java developer at heart, but I 
see a tremendous value in the .NET stack and in its languages. Not surprisingly 
then, I see a significant barrier of entry when telling long time .NET 
developers that if they want to use Apache Spark, they must code in either 
Scala/Java, Python, or R. The .NET team conducted a survey---with 1000+ 
responses---revealing a strong desire from the .NET developer community to 
learn and use Spark. This SPIP is about making that process much more familiar, 
but that’s not all its about. 

This SPIP is about the Microsoft community wanting to learn and contribute to 
the Apache Spark community, and we are fully funded to 

[jira] [Commented] (SPARK-27006) SPIP: .NET bindings for Apache Spark

2019-03-14 Thread Tyson Condie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792939#comment-16792939
 ] 

Tyson Condie commented on SPARK-27006:
--

I would like to briefly illuminate what I think this SPIP is trying to 
accomplish. I have worked in the Apache community for the better part of my 
career. Early on doing research at UC Berkeley related to Hadoop, then joining 
the Pig team at Yahoo! Research, and being part of the Microsoft CISL team that 
created Apache REEF, which turned out to be Microsoft’s first ever top-level 
Apache project and remains so to this day. I also had the brief pleasure of 
working with the Structured Stream team at Databricks and witnessed first-hand 
some of the exceptional minds behind Apache Spark.

So, what is this SPIP about? In my honest opinion, it is about bringing two 
very large communities together under a common shared goal: *to democratize 
data for all developers*. Given my roots, I am a Java developer at heart, but I 
see a tremendous value in the .NET stack and in its languages. Not surprisingly 
then, I see a significant barrier of entry when telling long time .NET 
developers that if they want to use Apache Spark, they must code in either 
Scala/Java, Python, or R. The .NET team conducted a survey---with 1000+ 
responses---revealing a strong desire from the .NET developer community to 
learn and use Spark. This SPIP is about making that process much more familiar, 
but that’s not all its about. 

This SPIP is about the Microsoft community wanting to learn and contribute to 
the Apache Spark community, and we are fully funded to do just that. Our 
leadership team includes Michael Rys and Rahul Potharaju from the Big Data 
organization, along with Ankit Asthana and Dan Moseley from .NET organization. 
Our development team includes Terry Kim, Steve Suh, Stephen Toub, Eric Erhardt, 
Aaron Robinson, and me, where I am again in the company of equally exceptional 
minds. Together, our goal is to develop .NET bindings for Spark in accordance 
to best practices from the Apache Foundation and Spark guidelines. We would 
welcome the opportunity to partner with leaders in the Apache Spark community, 
not only for their guidance on the work items described in this SPIP, but also 
on engagements that will bring our communities closer together and lead us to 
mutually beneficial outcomes.  

Regarding the work items in this SPIP, as recommended by earlier comments, we 
will develop externally (and openly) on a fork of Apache Spark. We only ask 
that a shepherd be available to provide us with occasional guidance towards 
getting our fork in a state that is acceptable for a contribution back to 
Apache Spark master. We recognize that such a contribution will not happen 
overnight, and that we will need to prove to the Spark community that we will 
continue to maintain it for the foreseeable future. That is why building a 
+diverse+ community is a very high priority for us, as it will ensure the 
future investments in .NET bindings for Apache Spark. All of this will take 
time. For now, we only ask if there is a Spark PMC member who is willing to 
step up and be our shepherd. 

Thank you for reading this far and we look forward to seeing you at the SF 
Spark Summit in April where we will be presenting our early progress on 
enabling .NET bindings for Apache Spark. 

 

> SPIP: .NET bindings for Apache Spark
> 
>
> Key: SPARK-27006
> URL: https://issues.apache.org/jira/browse/SPARK-27006
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Minor
>   Original Estimate: 4,032h
>  Remaining Estimate: 4,032h
>
> h4. Background and Motivation: 
> Apache Spark provides programming language support for Scala/Java (native), 
> and extensions for Python and R. While a variety of other language extensions 
> are possible to include in Apache Spark, .NET would bring one of the largest 
> developer community to the table. Presently, no good Big Data solution exists 
> for .NET developers in open source.  This SPIP aims at discussing how we can 
> bring Apache Spark goodness to the .NET development platform.  
> .NET is a free, cross-platform, open source developer platform for building 
> many different types of applications. With .NET, you can use multiple 
> languages, editors, and libraries to build for web, mobile, desktop, gaming, 
> and IoT types of applications. Even with .NET serving millions of developers, 
> there is no good Big Data solution that exists today, which this SPIP aims to 
> address.  
> The .NET developer community is one of the largest programming language 
> communities in the world. Its flagship programming language C# is listed as 
> one of the most popular p

[jira] [Assigned] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27163:


Assignee: (was: Apache Spark)

> Cleanup and consolidate Pandas UDF functionality
> 
>
> Key: SPARK-27163
> URL: https://issues.apache.org/jira/browse/SPARK-27163
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Some of the code for Pandas UDFs can be cleaned up and consolidated to remove 
> duplicated parts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27163:


Assignee: Apache Spark

> Cleanup and consolidate Pandas UDF functionality
> 
>
> Key: SPARK-27163
> URL: https://issues.apache.org/jira/browse/SPARK-27163
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> Some of the code for Pandas UDFs can be cleaned up and consolidated to remove 
> duplicated parts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality

2019-03-14 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-27163:
-
Priority: Minor  (was: Major)

> Cleanup and consolidate Pandas UDF functionality
> 
>
> Key: SPARK-27163
> URL: https://issues.apache.org/jira/browse/SPARK-27163
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> Some of the code for Pandas UDFs can be cleaned up and consolidated to remove 
> duplicated parts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27163) Cleanup and consolidate Pandas UDF functionality

2019-03-14 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-27163:


 Summary: Cleanup and consolidate Pandas UDF functionality
 Key: SPARK-27163
 URL: https://issues.apache.org/jira/browse/SPARK-27163
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: Bryan Cutler


Some of the code for Pandas UDFs can be cleaned up and consolidated to remove 
duplicated parts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26778) Implement file source V2 partitioning

2019-03-14 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792843#comment-16792843
 ] 

Ryan Blue commented on SPARK-26778:
---

[~Gengliang.Wang], can you clarify what this issue is tracking?

> Implement file source V2 partitioning 
> --
>
> Key: SPARK-26778
> URL: https://issues.apache.org/jira/browse/SPARK-26778
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26742) Bump Kubernetes Client Version to 4.1.2

2019-03-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-26742:
---
Fix Version/s: 2.4.2

> Bump Kubernetes Client Version to 4.1.2
> ---
>
> Key: SPARK-26742
> URL: https://issues.apache.org/jira/browse/SPARK-26742
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Kubernetes
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Steve Davids
>Assignee: Jiaxin Shan
>Priority: Major
>  Labels: easyfix
> Fix For: 2.4.2, 3.0.0
>
>
> Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master 
> branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest 
> Kubernetes compatibility support for newer clusters: 
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27158) dev/mima and dev/scalastyle support dynamic profiles

2019-03-14 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27158:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-23710

> dev/mima and dev/scalastyle support dynamic profiles
> 
>
> Key: SPARK-27158
> URL: https://issues.apache.org/jira/browse/SPARK-27158
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27162) Add new method getOriginalMap in CaseInsensitiveStringMap

2019-03-14 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-27162:
--

 Summary: Add new method getOriginalMap in CaseInsensitiveStringMap
 Key: SPARK-27162
 URL: https://issues.apache.org/jira/browse/SPARK-27162
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Currently, DataFrameReader/DataFrameReader supports setting Hadoop 
configurations via method `.option()`. 
E.g.
```
class TestFileFilter extends PathFilter {
  override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
}
withTempPath { dir =>
  val path = dir.getCanonicalPath

  val df = spark.range(2)
  df.write.orc(path + "/p=1")
  df.write.orc(path + "/p=2")
  assert(spark.read.orc(path).count() === 4)

  val extraOptions = Map(
"mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
"mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
  )
  assert(spark.read.options(extraOptions).orc(path).count() === 2)
}
```
While Hadoop Configurations are case sensitive, the current data source V2 APIs 
are using `CaseInsensitiveStringMap` in TableProvider. 
To create Hadoop configurations correctly, I suggest adding a method 
`getOriginalMap` in `CaseInsensitiveStringMap`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23710) Upgrade the built-in Hive to 2.3.4 for hadoop-3.1

2019-03-14 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-23710:

Issue Type: Umbrella  (was: Improvement)

> Upgrade the built-in Hive to 2.3.4 for hadoop-3.1
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Critical
>
> Spark fail to run on Hadoop 3.x, because Hive's ShimLoader considers Hadoop 
> 3.x to be an unknown Hadoop version. see SPARK-18673 and HIVE-16081 for more 
> details. So we need to upgrade the built-in Hive for Hadoop-3.x. This is an 
> umbrella JIRA to track this upgrade.
>  
> *Upgrade Plan*:
>  # SPARK-27054 Remove the Calcite dependency. This can avoid some jar 
> conflicts.
>  # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove 
> OrcProto.Type usage
>  # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles 
> when testing
>  # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and 
> compile passed on Hive 2.3.4
>  # Add an empty hive-thriftserverV2 module. then we could test all test cases 
> in next step
>  # Make Hadoop-3.1 with Hive 2.3.4 test passed
>  # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's 
> [TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift]
>  
> I have completed the [initial 
> work|https://github.com/apache/spark/pull/24044] and plan to finish this 
> upgrade step by step.
>   
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27162) Add new method getOriginalMap in CaseInsensitiveStringMap

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27162:


Assignee: Apache Spark

> Add new method getOriginalMap in CaseInsensitiveStringMap
> -
>
> Key: SPARK-27162
> URL: https://issues.apache.org/jira/browse/SPARK-27162
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, DataFrameReader/DataFrameReader supports setting Hadoop 
> configurations via method `.option()`. 
> E.g.
> ```
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   val df = spark.range(2)
>   df.write.orc(path + "/p=1")
>   df.write.orc(path + "/p=2")
>   assert(spark.read.orc(path).count() === 4)
>   val extraOptions = Map(
> "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
> "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
>   )
>   assert(spark.read.options(extraOptions).orc(path).count() === 2)
> }
> ```
> While Hadoop Configurations are case sensitive, the current data source V2 
> APIs are using `CaseInsensitiveStringMap` in TableProvider. 
> To create Hadoop configurations correctly, I suggest adding a method 
> `getOriginalMap` in `CaseInsensitiveStringMap`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27162) Add new method getOriginalMap in CaseInsensitiveStringMap

2019-03-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27162:


Assignee: (was: Apache Spark)

> Add new method getOriginalMap in CaseInsensitiveStringMap
> -
>
> Key: SPARK-27162
> URL: https://issues.apache.org/jira/browse/SPARK-27162
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, DataFrameReader/DataFrameReader supports setting Hadoop 
> configurations via method `.option()`. 
> E.g.
> ```
> class TestFileFilter extends PathFilter {
>   override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
> }
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   val df = spark.range(2)
>   df.write.orc(path + "/p=1")
>   df.write.orc(path + "/p=2")
>   assert(spark.read.orc(path).count() === 4)
>   val extraOptions = Map(
> "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
> "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
>   )
>   assert(spark.read.options(extraOptions).orc(path).count() === 2)
> }
> ```
> While Hadoop Configurations are case sensitive, the current data source V2 
> APIs are using `CaseInsensitiveStringMap` in TableProvider. 
> To create Hadoop configurations correctly, I suggest adding a method 
> `getOriginalMap` in `CaseInsensitiveStringMap`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27130) Automatically select profile when executing sbt-checkstyle

2019-03-14 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27130:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-23710

> Automatically select profile when executing sbt-checkstyle
> --
>
> Key: SPARK-27130
> URL: https://issues.apache.org/jira/browse/SPARK-27130
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27054) Remove Calcite dependency

2019-03-14 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27054:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-23710

> Remove Calcite dependency
> -
>
> Key: SPARK-27054
> URL: https://issues.apache.org/jira/browse/SPARK-27054
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Calcite is only used for 
> [runSqlHive|https://github.com/apache/spark/blob/02bbe977abaf7006b845a7e99d612b0235aa0025/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L699-L705]
>  when 
> {{hive.cbo.enable=true}}([SemanticAnalyzer|https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java#L278-L280]).
> So we can disable {{hive.cbo.enable}} and remove Calcite dependency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23710) Upgrade the built-in Hive to 2.3.4 for hadoop-3.1

2019-03-14 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-23710:

Description: 
Spark fail to run on Hadoop 3.x, because Hive's ShimLoader considers Hadoop 3.x 
to be an unknown Hadoop version. see SPARK-18673 and HIVE-16081 for more 
details. So we need to upgrade the built-in Hive for Hadoop-3.x. This is an 
umbrella JIRA to track this upgrade.

 

*Upgrade Plan*:
 # SPARK-27054 Remove the Calcite dependency. This can avoid some jar conflicts.
 # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove 
OrcProto.Type usage
 # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles 
when testing
 # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and 
compile passed on Hive 2.3.4
 # Add an empty hive-thriftserverV2 module. then we could test all test cases 
in next step
 # Make Hadoop-3.1 with Hive 2.3.4 test passed
 # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's 
[TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift]

 

I have completed the [initial work|https://github.com/apache/spark/pull/24044] 
and plan to finish this upgrade step by step.
  

 

  was:
Upgrade built-in Hive to 2.3.4 for Hadoop-3.1(Please note that this upgrade 
only for Hadoop-3.1).

To achieve this. We need to change sql/core, sql/hive, sql/hive-thriftserver 
modules at least:

*sql/core*: Add two source directories(sql/core/v1.2.1 and sql/core/v2.3.4) to 
distinguish the code for different built-in Hive.
 *sql/hive:* use Java reflect or shim to support Hive 1.2.1 and Hive 2.3.4 same 
time.
 *sql/hive-thriftserver:* Add new thriftserver named hive-thriftserverV2 with 
Hive 2.3.4's 
[TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift].

Spark fail to run on Hadoop 3.x, because Hive's shimloader considers Hadoop 3.x 
to be an unknown Hadoop version. see 
[SPARK-18673|https://issues.apache.org/jira/browse/SPARK-18673] and 
[HIVE-16081|https://issues.apache.org/jira/browse/HIVE-16081] for more details.

 

 

Upgrade Plan:
 # SPARK-27054 Remove the Calcite dependency. This can avoid some jar conflicts.
 # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove 
OrcProto.Type usage
 # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles 
when testing
 # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and 
compile passed on Hive 2.3.4
 # Add an empty hive-thriftserverV2 module. then we could test all test cases 
in next step
 # Make Hadoop-3.1 with Hive 2.3.4 test passed
 # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's 
[TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift]

 

 


> Upgrade the built-in Hive to 2.3.4 for hadoop-3.1
> -
>
> Key: SPARK-23710
> URL: https://issues.apache.org/jira/browse/SPARK-23710
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Critical
>
> Spark fail to run on Hadoop 3.x, because Hive's ShimLoader considers Hadoop 
> 3.x to be an unknown Hadoop version. see SPARK-18673 and HIVE-16081 for more 
> details. So we need to upgrade the built-in Hive for Hadoop-3.x. This is an 
> umbrella JIRA to track this upgrade.
>  
> *Upgrade Plan*:
>  # SPARK-27054 Remove the Calcite dependency. This can avoid some jar 
> conflicts.
>  # SPARK-23749 Replace built-in Hive API (isSub/toKryo) and remove 
> OrcProto.Type usage
>  # SPARK-27158, SPARK-27130 Update dev/* to support dynamic change profiles 
> when testing
>  # Fix ORC dependency conflict to makes it test passed on Hive 1.2.1 and 
> compile passed on Hive 2.3.4
>  # Add an empty hive-thriftserverV2 module. then we could test all test cases 
> in next step
>  # Make Hadoop-3.1 with Hive 2.3.4 test passed
>  # Adapted hive-thriftserverV2 from hive-thriftserver with Hive 2.3.4's 
> [TCLIService.thrift|https://github.com/apache/hive/blob/rel/release-2.3.4/service-rpc/if/TCLIService.thrift]
>  
> I have completed the [initial 
> work|https://github.com/apache/spark/pull/24044] and plan to finish this 
> upgrade step by step.
>   
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27152) Column equality does not work for aliased columns.

2019-03-14 Thread Ryan Radtke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792793#comment-16792793
 ] 

Ryan Radtke edited comment on SPARK-27152 at 3/14/19 3:36 PM:
--

If you are abstracting elt then it is important.  Also, its just sloppy.  
Probably not a major issue though.  I changed it to minor.


was (Author: ryanwradtke-thmbprnt):
If you are abstracting elt then it important.  Also, its just sloppy.

> Column equality does not work for aliased columns.
> --
>
> Key: SPARK-27152
> URL: https://issues.apache.org/jira/browse/SPARK-27152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ryan Radtke
>Priority: Minor
>
> assert($"zip".as("zip_code") equals $"zip".as("zip_code")) will return false
> assert($"zip" equals $"zip") will return true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >