[jira] [Updated] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27848:
-
Description: 
R 3.6.0 is released last month. We better test higher versions of R in AppVeyor.

We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
warnings below:

{code}
Error in strptime(xx, f, tz = tz) : 
  (converted from warning) unable to identify current timezone 'C':
please set environment variable 'TZ'
Error in i.p(...) : 
  (converted from warning) installation of package 'praise' had non-zero exit 
status
Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
i.p
Execution halted
{code}

This JIRA targets to the latest of R version 3.6.0.

  was:R 3.6.0 is released last month. We better test higher versions of R in 
AppVeyor.


> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.
> We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
> warnings below:
> {code}
> Error in strptime(xx, f, tz = tz) : 
>   (converted from warning) unable to identify current timezone 'C':
> please set environment variable 'TZ'
> Error in i.p(...) : 
>   (converted from warning) installation of package 'praise' had non-zero exit 
> status
> Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
> i.p
> Execution halted
> {code}
> This JIRA targets to the latest of R version 3.6.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25944:
-
Description: 
R 3.5.1 is released few months ago. We better test higher versions of R in 
AppVeyor.

R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 
to 3.6.0 in AppVeyor.

We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
warnings below:

{code}
Error in strptime(xx, f, tz = tz) : 
  (converted from warning) unable to identify current timezone 'C':
please set environment variable 'TZ'
Error in i.p(...) : 
  (converted from warning) installation of package 'praise' had non-zero exit 
status
Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
i.p
Execution halted
{code}

This JIRA targets to the latest of R version 3.6.0.

  was:
R 3.5.1 is released few months ago. We better test higher versions of R in 
AppVeyor.

R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 
to 3.6.0 in AppVeyor.

This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
warnings below:

{code}
Error in strptime(xx, f, tz = tz) : 
  (converted from warning) unable to identify current timezone 'C':
please set environment variable 'TZ'
Error in i.p(...) : 
  (converted from warning) installation of package 'praise' had non-zero exit 
status
Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
i.p
Execution halted
{code}

This JIRA targets to the latest of R version 3.6.0 for now.


> AppVeyor change to latest R version
> ---
>
> Key: SPARK-25944
> URL: https://issues.apache.org/jira/browse/SPARK-25944
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.5.1 is released few months ago. We better test higher versions of R in 
> AppVeyor.
> R 3.6.0 is released 2019-04-26. This PR targets to change R version from 
> 3.5.1 to 3.6.0 in AppVeyor.
> We should set `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
> warnings below:
> {code}
> Error in strptime(xx, f, tz = tz) : 
>   (converted from warning) unable to identify current timezone 'C':
> please set environment variable 'TZ'
> Error in i.p(...) : 
>   (converted from warning) installation of package 'praise' had non-zero exit 
> status
> Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
> i.p
> Execution halted
> {code}
> This JIRA targets to the latest of R version 3.6.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25944:
-
Description: 
R 3.5.1 is released few months ago. We better test higher versions of R in 
AppVeyor.

R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 
to 3.6.0 in AppVeyor.

This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
warnings below:

{code}
Error in strptime(xx, f, tz = tz) : 
  (converted from warning) unable to identify current timezone 'C':
please set environment variable 'TZ'
Error in i.p(...) : 
  (converted from warning) installation of package 'praise' had non-zero exit 
status
Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
i.p
Execution halted
{code}

This JIRA targets to the latest of R version 3.6.0 for now.

  was:
R 3.5.1 is released few months ago. We better test higher versions of R in 
AppVeyor.

R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 
to 3.6.0 in AppVeyor.

This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
warnings below:

```
Error in strptime(xx, f, tz = tz) : 
  (converted from warning) unable to identify current timezone 'C':
please set environment variable 'TZ'
Error in i.p(...) : 
  (converted from warning) installation of package 'praise' had non-zero exit 
status
Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
i.p
Execution halted
```

This JIRA targets to the latest of R version 3.6.0 for now.


> AppVeyor change to latest R version
> ---
>
> Key: SPARK-25944
> URL: https://issues.apache.org/jira/browse/SPARK-25944
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.5.1 is released few months ago. We better test higher versions of R in 
> AppVeyor.
> R 3.6.0 is released 2019-04-26. This PR targets to change R version from 
> 3.5.1 to 3.6.0 in AppVeyor.
> This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
> warnings below:
> {code}
> Error in strptime(xx, f, tz = tz) : 
>   (converted from warning) unable to identify current timezone 'C':
> please set environment variable 'TZ'
> Error in i.p(...) : 
>   (converted from warning) installation of package 'praise' had non-zero exit 
> status
> Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
> i.p
> Execution halted
> {code}
> This JIRA targets to the latest of R version 3.6.0 for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25944:
-
Description: 
R 3.5.1 is released few months ago. We better test higher versions of R in 
AppVeyor.

R 3.6.0 is released 2019-04-26. This PR targets to change R version from 3.5.1 
to 3.6.0 in AppVeyor.

This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
warnings below:

```
Error in strptime(xx, f, tz = tz) : 
  (converted from warning) unable to identify current timezone 'C':
please set environment variable 'TZ'
Error in i.p(...) : 
  (converted from warning) installation of package 'praise' had non-zero exit 
status
Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
i.p
Execution halted
```

This JIRA targets to the latest of R version 3.6.0 for now.

  was:R 3.5.1 is released few months ago. We better test higher versions of R 
in AppVeyor.


> AppVeyor change to latest R version
> ---
>
> Key: SPARK-25944
> URL: https://issues.apache.org/jira/browse/SPARK-25944
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.5.1 is released few months ago. We better test higher versions of R in 
> AppVeyor.
> R 3.6.0 is released 2019-04-26. This PR targets to change R version from 
> 3.5.1 to 3.6.0 in AppVeyor.
> This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
> warnings below:
> ```
> Error in strptime(xx, f, tz = tz) : 
>   (converted from warning) unable to identify current timezone 'C':
> please set environment variable 'TZ'
> Error in i.p(...) : 
>   (converted from warning) installation of package 'praise' had non-zero exit 
> status
> Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
> i.p
> Execution halted
> ```
> This JIRA targets to the latest of R version 3.6.0 for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25944) AppVeyor change to latest R version

2019-05-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849354#comment-16849354
 ] 

Hyukjin Kwon commented on SPARK-25944:
--

Ah .. I realised that I messed up JIRA and PR links. I fixed it accordingly.

> AppVeyor change to latest R version
> ---
>
> Key: SPARK-25944
> URL: https://issues.apache.org/jira/browse/SPARK-25944
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.5.1 is released few months ago. We better test higher versions of R in 
> AppVeyor.
> R 3.6.0 is released 2019-04-26. This PR targets to change R version from 
> 3.5.1 to 3.6.0 in AppVeyor.
> This PR sets `R_REMOTES_NO_ERRORS_FROM_WARNINGS` to `true` to avoid the 
> warnings below:
> ```
> Error in strptime(xx, f, tz = tz) : 
>   (converted from warning) unable to identify current timezone 'C':
> please set environment variable 'TZ'
> Error in i.p(...) : 
>   (converted from warning) installation of package 'praise' had non-zero exit 
> status
> Calls:  ... with_rprofile_user -> with_envvar -> force -> force -> 
> i.p
> Execution halted
> ```
> This JIRA targets to the latest of R version 3.6.0 for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27848:
-
Comment: was deleted

(was: User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/24716)

> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27848:
-
Comment: was deleted

(was: Fixed in https://github.com/apache/spark/pull/24716)

> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27848.
--
Resolution: Duplicate

> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25944) AppVeyor change to latest R version

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25944:
-
Summary: AppVeyor change to latest R version  (was: AppVeyor change to 
latest R version (3.5.1))

> AppVeyor change to latest R version
> ---
>
> Key: SPARK-25944
> URL: https://issues.apache.org/jira/browse/SPARK-25944
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.5.1 is released few months ago. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-27848:
--

> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849353#comment-16849353
 ] 

Apache Spark commented on SPARK-27848:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/24716

> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27848.
--
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/24716

> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27848) AppVeyor change to latest R version (3.6.0)

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27848:
-
Fix Version/s: 3.0.0

> AppVeyor change to latest R version (3.6.0)
> ---
>
> Key: SPARK-27848
> URL: https://issues.apache.org/jira/browse/SPARK-27848
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> R 3.6.0 is released last month. We better test higher versions of R in 
> AppVeyor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-27860) Use efficient sorting instead of `.sorted.reverse` sequence

2019-05-27 Thread wenxuanguan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan closed SPARK-27860.
---

> Use efficient sorting instead of `.sorted.reverse` sequence
> ---
>
> Key: SPARK-27860
> URL: https://issues.apache.org/jira/browse/SPARK-27860
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Structured Streaming, Web UI
>Affects Versions: 2.4.3
>Reporter: wenxuanguan
>Priority: Minor
>
> use descending sort instead of two action of ascending sort and reverse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27860) Use efficient sorting instead of `.sorted.reverse` sequence

2019-05-27 Thread wenxuanguan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan resolved SPARK-27860.
-
Resolution: Duplicate

> Use efficient sorting instead of `.sorted.reverse` sequence
> ---
>
> Key: SPARK-27860
> URL: https://issues.apache.org/jira/browse/SPARK-27860
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Structured Streaming, Web UI
>Affects Versions: 2.4.3
>Reporter: wenxuanguan
>Priority: Minor
>
> use descending sort instead of two action of ascending sort and reverse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-341) Added MapPartitionsWithSplitRDD.

2019-05-27 Thread wenxuanguan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan closed SPARK-341.
-

> Added MapPartitionsWithSplitRDD.
> 
>
> Key: SPARK-341
> URL: https://issues.apache.org/jira/browse/SPARK-341
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27860) Use efficient sorting instead of `.sorted.reverse` sequence

2019-05-27 Thread wenxuanguan (JIRA)
wenxuanguan created SPARK-27860:
---

 Summary: Use efficient sorting instead of `.sorted.reverse` 
sequence
 Key: SPARK-27860
 URL: https://issues.apache.org/jira/browse/SPARK-27860
 Project: Spark
  Issue Type: Improvement
  Components: DStreams, Structured Streaming, Web UI
Affects Versions: 2.4.3
Reporter: wenxuanguan


use descending sort instead of two action of ascending sort and reverse.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence

2019-05-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27859.
---
   Resolution: Fixed
 Assignee: wenxuanguan
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24711

> Use efficient sorting instead of `.sorted.reverse` sequence
> ---
>
> Key: SPARK-27859
> URL: https://issues.apache.org/jira/browse/SPARK-27859
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: wenxuanguan
>Priority: Trivial
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27859:


Assignee: Apache Spark

> Use efficient sorting instead of `.sorted.reverse` sequence
> ---
>
> Key: SPARK-27859
> URL: https://issues.apache.org/jira/browse/SPARK-27859
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27859:


Assignee: (was: Apache Spark)

> Use efficient sorting instead of `.sorted.reverse` sequence
> ---
>
> Key: SPARK-27859
> URL: https://issues.apache.org/jira/browse/SPARK-27859
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27859) Use efficient sorting instead of `.sorted.reverse` sequence

2019-05-27 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27859:
-

 Summary: Use efficient sorting instead of `.sorted.reverse` 
sequence
 Key: SPARK-27859
 URL: https://issues.apache.org/jira/browse/SPARK-27859
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27833) java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark

2019-05-27 Thread Raviteja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849320#comment-16849320
 ] 

Raviteja edited comment on SPARK-27833 at 5/28/19 4:25 AM:
---

I have tried step by step, i faced the issue when i add watermark  along with 
group and agg functions. Without watermarking the job is successfully 
completing.  A similar issue has been raise 
[27564|https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-27564]
we cannot use "console sink" because it will print the output in console. we 
are using "custom sink" because we need to write the data to other databases. I 
dont know much about scala but i think error is because the query planner is 
not able find a plan for watermarking when creating a physical plan.


was (Author: ravitejasutrave):
I have tried step by step, i faced the issue when i add watermark  along with 
group and agg functions. Without watermarking the job is successfully 
completing. 
we cannot use "console sink" because it will print the output in console. we 
are using "custom sink" because we need to write the data to other databases. I 
dont know much about scala but i think error is because the query planner is 
not able find a plan for watermarking when creating a physical plan.

>  java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> 
>
> Key: SPARK-27833
> URL: https://issues.apache.org/jira/browse/SPARK-27833
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
> Environment: spark 2.3.0
> java 1.8
> kafka version 0.10.
>Reporter: Raviteja
>Priority: Minor
>  Labels: spark-streaming-kafka
> Attachments: kafka_consumer_code.java, kafka_custom_sink.java, 
> kafka_error_log.txt
>
>
> Hi ,
> We have a requirement to read data from kafka, apply some transformation and 
> store data to database .For this we are implementing watermarking feature 
> along with aggregate function and  for storing we are writing our own sink 
> (Structured streaming) .we are using spark 2.3.0, java 1.8 and kafka version 
> 0.10.
>  We are getting the below error.
> "*java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> timestamp#39: timestamp, interval 2 minutes*"
>  
> works perfectly fine when we use Console as sink instead custom sink.  For 
> Debugging the issue, we are performing  "dataframe.show()" in our custom sink 
> and nothing else.  
> Please find the attachment for the Error log and the code. Please look into 
> this issue as this a blocker and we are not able to proceed further or find 
> any alternatives as we need watermarking feature. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27833) java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark

2019-05-27 Thread Raviteja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849320#comment-16849320
 ] 

Raviteja commented on SPARK-27833:
--

I have tried step by step, i faced the issue when i add watermark  along with 
group and agg functions. Without watermarking the job is successfully 
completing. 
we cannot use "console sink" because it will print the output in console. we 
are using "custom sink" because we need to write the data to other databases. I 
dont know much about scala but i think error is because the query planner is 
not able find a plan for watermarking when creating a physical plan.

>  java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> 
>
> Key: SPARK-27833
> URL: https://issues.apache.org/jira/browse/SPARK-27833
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
> Environment: spark 2.3.0
> java 1.8
> kafka version 0.10.
>Reporter: Raviteja
>Priority: Minor
>  Labels: spark-streaming-kafka
> Attachments: kafka_consumer_code.java, kafka_custom_sink.java, 
> kafka_error_log.txt
>
>
> Hi ,
> We have a requirement to read data from kafka, apply some transformation and 
> store data to database .For this we are implementing watermarking feature 
> along with aggregate function and  for storing we are writing our own sink 
> (Structured streaming) .we are using spark 2.3.0, java 1.8 and kafka version 
> 0.10.
>  We are getting the below error.
> "*java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> timestamp#39: timestamp, interval 2 minutes*"
>  
> works perfectly fine when we use Console as sink instead custom sink.  For 
> Debugging the issue, we are performing  "dataframe.show()" in our custom sink 
> and nothing else.  
> Please find the attachment for the Error log and the code. Please look into 
> this issue as this a blocker and we are not able to proceed further or find 
> any alternatives as we need watermarking feature. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23191) Workers registration failes in case of network drop

2019-05-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23191:
---

Assignee: wuyi

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Assignee: wuyi
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to :7077
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>     at 
> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
>     at 

[jira] [Resolved] (SPARK-23191) Workers registration failes in case of network drop

2019-05-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23191.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24569
[https://github.com/apache/spark/pull/24569]

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Assignee: wuyi
>Priority: Critical
> Fix For: 3.0.0
>
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to :7077
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>     at 
> 

[jira] [Updated] (SPARK-27578) Support INTERVAL ... HOUR TO SECOND syntax

2019-05-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27578:
--
Summary: Support INTERVAL ... HOUR TO SECOND syntax  (was: Add support for 
"interval '23:59:59' hour to second")

> Support INTERVAL ... HOUR TO SECOND syntax
> --
>
> Key: SPARK-27578
> URL: https://issues.apache.org/jira/browse/SPARK-27578
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Currently, SparkSQL can support interval format like this. 
>  
> {code:java}
> select interval '5 23:59:59.155' day to second.{code}
>  
> Can SparkSQL support grammar like below, as Presto/Teradata can support it 
> well now.
> {code:java}
> select interval '23:59:59.155' hour to second
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27578) Add support for "interval '23:59:59' hour to second"

2019-05-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27578:
--
Issue Type: Improvement  (was: Bug)

> Add support for "interval '23:59:59' hour to second"
> 
>
> Key: SPARK-27578
> URL: https://issues.apache.org/jira/browse/SPARK-27578
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Currently, SparkSQL can support interval format like this. 
>  
> {code:java}
> select interval '5 23:59:59.155' day to second.{code}
>  
> Can SparkSQL support grammar like below, as Teradata can support it well now.
> {code:java}
> select interval '23:59:59.155' hour to second
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27578) Add support for "interval '23:59:59' hour to second"

2019-05-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27578:
--
Description: 
Currently, SparkSQL can support interval format like this. 

 
{code:java}
select interval '5 23:59:59.155' day to second.{code}
 

Can SparkSQL support grammar like below, as Presto/Teradata can support it well 
now.
{code:java}
select interval '23:59:59.155' hour to second
{code}
 

 

  was:
Currently, SparkSQL can support interval format like this. 

 
{code:java}
select interval '5 23:59:59.155' day to second.{code}
 

Can SparkSQL support grammar like below, as Teradata can support it well now.
{code:java}
select interval '23:59:59.155' hour to second
{code}
 

 


> Add support for "interval '23:59:59' hour to second"
> 
>
> Key: SPARK-27578
> URL: https://issues.apache.org/jira/browse/SPARK-27578
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> Currently, SparkSQL can support interval format like this. 
>  
> {code:java}
> select interval '5 23:59:59.155' day to second.{code}
>  
> Can SparkSQL support grammar like below, as Presto/Teradata can support it 
> well now.
> {code:java}
> select interval '23:59:59.155' hour to second
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types

2019-05-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27858.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.4

This is resolved via https://github.com/apache/spark/pull/24722

> Fix for avro deserialization on union types with multiple non-null types
> 
>
> Key: SPARK-27858
> URL: https://issues.apache.org/jira/browse/SPARK-27858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> This issue aims to fix a union avro type with more than one non-null value 
> (for instance `["string", "null", "int"]`) the deserialization to a DataFrame 
> would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that 
> the `fieldWriter` relied on the index from the avro schema before null were 
> filtered out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27858:


Assignee: (was: Apache Spark)

> Fix for avro deserialization on union types with multiple non-null types
> 
>
> Key: SPARK-27858
> URL: https://issues.apache.org/jira/browse/SPARK-27858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix a union avro type with more than one non-null value 
> (for instance `["string", "null", "int"]`) the deserialization to a DataFrame 
> would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that 
> the `fieldWriter` relied on the index from the avro schema before null were 
> filtered out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types

2019-05-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849257#comment-16849257
 ] 

Apache Spark commented on SPARK-27858:
--

User 'gcmerz' has created a pull request for this issue:
https://github.com/apache/spark/pull/24722

> Fix for avro deserialization on union types with multiple non-null types
> 
>
> Key: SPARK-27858
> URL: https://issues.apache.org/jira/browse/SPARK-27858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to fix a union avro type with more than one non-null value 
> (for instance `["string", "null", "int"]`) the deserialization to a DataFrame 
> would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that 
> the `fieldWriter` relied on the index from the avro schema before null were 
> filtered out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27858:


Assignee: Apache Spark

> Fix for avro deserialization on union types with multiple non-null types
> 
>
> Key: SPARK-27858
> URL: https://issues.apache.org/jira/browse/SPARK-27858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This issue aims to fix a union avro type with more than one non-null value 
> (for instance `["string", "null", "int"]`) the deserialization to a DataFrame 
> would throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that 
> the `fieldWriter` relied on the index from the avro schema before null were 
> filtered out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27858) Fix for avro deserialization on union types with multiple non-null types

2019-05-27 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27858:
-

 Summary: Fix for avro deserialization on union types with multiple 
non-null types
 Key: SPARK-27858
 URL: https://issues.apache.org/jira/browse/SPARK-27858
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3, 3.0.0
Reporter: Dongjoon Hyun


This issue aims to fix a union avro type with more than one non-null value (for 
instance `["string", "null", "int"]`) the deserialization to a DataFrame would 
throw a `java.lang.ArrayIndexOutOfBoundsException`. The issue was that the 
`fieldWriter` relied on the index from the avro schema before null were 
filtered out.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27857:


Assignee: Apache Spark

> DataSourceV2: Support ALTER TABLE statements
> 
>
> Key: SPARK-27857
> URL: https://issues.apache.org/jira/browse/SPARK-27857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> ALTER TABLE statements should be supported for v2 tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27857:


Assignee: (was: Apache Spark)

> DataSourceV2: Support ALTER TABLE statements
> 
>
> Key: SPARK-27857
> URL: https://issues.apache.org/jira/browse/SPARK-27857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Ryan Blue
>Priority: Major
>
> ALTER TABLE statements should be supported for v2 tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27857) DataSourceV2: Support ALTER TABLE statements

2019-05-27 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-27857:
-

 Summary: DataSourceV2: Support ALTER TABLE statements
 Key: SPARK-27857
 URL: https://issues.apache.org/jira/browse/SPARK-27857
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.3
Reporter: Ryan Blue


ALTER TABLE statements should be supported for v2 tables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27768) Infinity, -Infinity, NaN should be recognized in a case insensitive manner

2019-05-27 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849078#comment-16849078
 ] 

Dilip Biswal commented on SPARK-27768:
--

[~dongjoon]
 Thanks for trying out Presto.

Just want to share my 2 cents before we take a final call on it. I am okay with 
whatever you guys decide :).
 There seems to be a subtle difference between Presto and Spark ? Spark returns 
"NULL" in this case where as presto returns an error ? Because of this i think 
we should be more accommodative of data that is accepted in other systems. I am 
afraid, because of "authoring null" semantics, sometimes during the etl process 
we will treat some valid input from other systems as nulls and its probably 
hard for users to locate the bad record and fix..

Lets say for a second that we decide to accept this case. So technically, we 
will not be portable with Hive and Presto. But we are allowing something more 
that these two systems, right ? Do we think that some users would actually want 
the strings such as "infinity" to be treated as null and would be negatively 
surprised to see the new behaviour ? 
 Let me know what you think..

> Infinity, -Infinity, NaN should be recognized in a case insensitive manner
> --
>
> Key: SPARK-27768
> URL: https://issues.apache.org/jira/browse/SPARK-27768
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> When the inputs contain the constant 'infinity', Spark SQL does not generate 
> the expected results.
> {code:java}
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x);
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('infinity'), ('1')) v(x);
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('infinity'), ('infinity')) v(x);
> SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
> FROM (VALUES ('-infinity'), ('infinity')) v(x);{code}
>  The root cause: Spark SQL does not recognize the special constants in a case 
> insensitive way. In PostgreSQL, they are recognized in a case insensitive 
> way. 
> Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27071) Expose additional metrics in status.api.v1.StageData

2019-05-27 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-27071.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Expose additional metrics in status.api.v1.StageData
> 
>
> Key: SPARK-27071
> URL: https://issues.apache.org/jira/browse/SPARK-27071
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently StageData exposes the following metrics:
>  * executorRunTime
>  * executorCpuTime
>  * inputBytes
>  * inputRecords
>  * outputBytes
>  * outputRecords
>  * shuffleReadBytes
>  * shuffleReadRecords
>  * shuffleWriteBytes
>  * shuffleWriteRecords
>  * memoryBytesSpilled
>  * diskBytesSpilled
> These metrics are computed by aggregating the metrics of the tasks in the 
> stage. For the task metrics however we keep track of a lot more metrics. 
> Currently these metrics are also computed for stages (such shuffle read fetch 
> wait time), but these are not exposed through the api. It would be very 
> useful if these were also exposed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27071) Expose additional metrics in status.api.v1.StageData

2019-05-27 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell reassigned SPARK-27071:
-

Assignee: Tom van Bussel

> Expose additional metrics in status.api.v1.StageData
> 
>
> Key: SPARK-27071
> URL: https://issues.apache.org/jira/browse/SPARK-27071
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Tom van Bussel
>Assignee: Tom van Bussel
>Priority: Major
>
> Currently StageData exposes the following metrics:
>  * executorRunTime
>  * executorCpuTime
>  * inputBytes
>  * inputRecords
>  * outputBytes
>  * outputRecords
>  * shuffleReadBytes
>  * shuffleReadRecords
>  * shuffleWriteBytes
>  * shuffleWriteRecords
>  * memoryBytesSpilled
>  * diskBytesSpilled
> These metrics are computed by aggregating the metrics of the tasks in the 
> stage. For the task metrics however we keep track of a lot more metrics. 
> Currently these metrics are also computed for stages (such shuffle read fetch 
> wait time), but these are not exposed through the api. It would be very 
> useful if these were also exposed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27777) Eliminate uncessary sliding job in AreaUnderCurve

2019-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-2:
-

Assignee: zhengruifeng

> Eliminate uncessary sliding job in AreaUnderCurve
> -
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> Current impl of \{AreaUnderCurve} use \{SlidingRDD} to perform sliding, in 
> which a prepending job is needed to compute the head items on each partition.
> However, this job can be eliminated in computation of AUC by collecting local 
> areas and head/last at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27777) Eliminate uncessary sliding job in AreaUnderCurve

2019-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24648
[https://github.com/apache/spark/pull/24648]

> Eliminate uncessary sliding job in AreaUnderCurve
> -
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> Current impl of \{AreaUnderCurve} use \{SlidingRDD} to perform sliding, in 
> which a prepending job is needed to compute the head items on each partition.
> However, this job can be eliminated in computation of AUC by collecting local 
> areas and head/last at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes

2019-05-27 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849016#comment-16849016
 ] 

Liang-Chi Hsieh commented on SPARK-27855:
-

If you notice, the printed schema of two Datasets is different. The columns 
have different order. Dataset.union resolves columns by position. This is well 
documented in the API doc.

If you want to resolve columns by name, please use Dataset.unionByName API.

> Union failed between 2 datasets of the same type converted from different 
> dataframes
> 
>
> Key: SPARK-27855
> URL: https://issues.apache.org/jira/browse/SPARK-27855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Hao Ren
>Priority: Major
>
> 2 Datasets of the same type converted from different dataframes can not union.
> Here is the code to reproduce the problem. It seems `union` just checks the 
> schema of the orignal dataframe, even if the two datasets have already been 
> converted to the same type of dataset.
> {code:java}
> case class Entity(key: Int, a: Int, b: String)
> val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
> val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
> df1.printSchema
> df2.printSchema
> df1 union df2
> {code}
> Result
> {code:java}
> defined class Entity
> df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more 
> field]
> df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
> field]
> converted
> root
> |-- key: integer (nullable = false)
> |-- a: integer (nullable = false)
> |-- b: string (nullable = true)
> root
> |-- key: integer (nullable = false)
> |-- b: string (nullable = true)
> |-- a: integer (nullable = false)
> org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
> as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "a")
> - root class: "Entity"{code}
> The problem is that the two datasets of the same type have different schemas.
> The schema of the dataset does not conserve the order of the fields in the 
> case class definition, but the one of the original dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2019-05-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27665:
---

Assignee: Yuanjian Li

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/browse/SPARK-27665
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks 
> protocol to describe the fetch request for shuffle blocks, and it causes the 
> extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very 
> awkward. We need a new protocol only for shuffle blocks fetcher.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2019-05-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27665.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24565
[https://github.com/apache/spark/pull/24565]

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/browse/SPARK-27665
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks 
> protocol to describe the fetch request for shuffle blocks, and it causes the 
> extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very 
> awkward. We need a new protocol only for shuffle blocks fetcher.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27776) Avoid duplicate Java reflection in DataSource

2019-05-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27776:
--
Priority: Trivial  (was: Minor)

> Avoid duplicate Java reflection in DataSource
> -
>
> Key: SPARK-27776
> URL: https://issues.apache.org/jira/browse/SPARK-27776
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Trivial
>
> I checked the code of 
> {code:java}
> org.apache.spark.sql.execution.datasources.DataSource{code}
> , there exists duplicate Java reflection.
> I want to avoid it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27856) do not forcibly add cast when inserting table

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27856:


Assignee: Apache Spark  (was: Wenchen Fan)

> do not forcibly add cast when inserting table
> -
>
> Key: SPARK-27856
> URL: https://issues.apache.org/jira/browse/SPARK-27856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27856) do not forcibly add cast when inserting table

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27856:


Assignee: Wenchen Fan  (was: Apache Spark)

> do not forcibly add cast when inserting table
> -
>
> Key: SPARK-27856
> URL: https://issues.apache.org/jira/browse/SPARK-27856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27856) do not forcibly add cast when inserting table

2019-05-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27856:
---

 Summary: do not forcibly add cast when inserting table
 Key: SPARK-27856
 URL: https://issues.apache.org/jira/browse/SPARK-27856
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes

2019-05-27 Thread Hao Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-27855:

Description: 
2 Datasets of the same type converted from different dataframes can not union.

Here is the code to reproduce the problem. It seems `union` just checks the 
schema of the orignal dataframe, even if the two datasets have already been 
converted to the same type of dataset.
{code:java}
case class Entity(key: Int, a: Int, b: String)
val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
df1.printSchema
df2.printSchema
df1 union df2
{code}
Result
{code:java}
defined class Entity
df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field]
df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
field]
converted
root
|-- key: integer (nullable = false)
|-- a: integer (nullable = false)
|-- b: string (nullable = true)

root
|-- key: integer (nullable = false)
|-- b: string (nullable = true)
|-- a: integer (nullable = false)

org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "a")
- root class: "Entity"{code}
The problem is that the two datasets of the same type have different schemas.

The schema of the dataset does not conserve the order of the fields in the case 
class definition, but the one of the original dataframe

  was:
2 Datasets of the same type converted from different dataframes can not union.

Here is the code to reproduce the problem. It seems `union` just checks the 
schema of the orignal dataframe, even if the two datasets have already been 
converted to the same type of dataset.
{code:java}
case class Entity(key: Int, a: Int, b: String)
val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
df1.printSchema
df2.printSchema
df1 union df2
{code}
Result
{code:java}
defined class Entity
df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field]
df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
field]
converted
root
|-- key: integer (nullable = false)
|-- a: integer (nullable = false)
|-- b: string (nullable = true)

root
|-- key: integer (nullable = false)
|-- b: string (nullable = true)
|-- a: integer (nullable = false)

org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "a")
- root class: "Entity"{code}


> Union failed between 2 datasets of the same type converted from different 
> dataframes
> 
>
> Key: SPARK-27855
> URL: https://issues.apache.org/jira/browse/SPARK-27855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Hao Ren
>Priority: Major
>
> 2 Datasets of the same type converted from different dataframes can not union.
> Here is the code to reproduce the problem. It seems `union` just checks the 
> schema of the orignal dataframe, even if the two datasets have already been 
> converted to the same type of dataset.
> {code:java}
> case class Entity(key: Int, a: Int, b: String)
> val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
> val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
> df1.printSchema
> df2.printSchema
> df1 union df2
> {code}
> Result
> {code:java}
> defined class Entity
> df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more 
> field]
> df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
> field]
> converted
> root
> |-- key: integer (nullable = false)
> |-- a: integer (nullable = false)
> |-- b: string (nullable = true)
> root
> |-- key: integer (nullable = false)
> |-- b: string (nullable = true)
> |-- a: integer (nullable = false)
> org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
> as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "a")
> - root class: "Entity"{code}
> The problem is that the two datasets of the same type have different schemas.
> The schema of the dataset does not conserve the order of the fields in the 
> case class definition, but the one of the original dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes

2019-05-27 Thread Hao Ren (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Ren updated SPARK-27855:

Description: 
2 Datasets of the same type converted from different dataframes can not union.

Here is the code to reproduce the problem. It seems `union` just checks the 
schema of the orignal dataframe, even if the two datasets have already been 
converted to the same type of dataset.
{code:java}
case class Entity(key: Int, a: Int, b: String)
val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
df1.printSchema
df2.printSchema
df1 union df2
{code}
Result
{code:java}
defined class Entity
df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more field]
df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
field]
converted
root
|-- key: integer (nullable = false)
|-- a: integer (nullable = false)
|-- b: string (nullable = true)

root
|-- key: integer (nullable = false)
|-- b: string (nullable = true)
|-- a: integer (nullable = false)

org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "a")
- root class: "Entity"{code}

  was:
2 Datasets of the same type converted from different dataframes can not union.

Here is the code to reproduce the problem. It seems `union` just checks the 
schema of the orignal dataframe, even if the two datasets have already been 
converted to the same type of dataset.
{code:java}
case class Entity(key: Int, a: Int, b: String)
val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
df1.printSchema
df2.printSchema
df1 union df2
{code}
Result
{code:java}
defined class Entity df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: 
int ... 1 more field] df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: 
string ... 1 more field] converted root |-- key: integer (nullable = false) |-- 
a: integer (nullable = false) |-- b: string (nullable = true) root |-- key: 
integer (nullable = false) |-- b: string (nullable = true) |-- a: integer 
(nullable = false) org.apache.spark.sql.AnalysisException: Cannot up cast `a` 
from string to int as it may truncate The type path of the target object is: - 
field (class: "scala.Int", name: "a") - root class: "Entity" You can either add 
an expl
{code}


> Union failed between 2 datasets of the same type converted from different 
> dataframes
> 
>
> Key: SPARK-27855
> URL: https://issues.apache.org/jira/browse/SPARK-27855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Hao Ren
>Priority: Major
>
> 2 Datasets of the same type converted from different dataframes can not union.
> Here is the code to reproduce the problem. It seems `union` just checks the 
> schema of the orignal dataframe, even if the two datasets have already been 
> converted to the same type of dataset.
> {code:java}
> case class Entity(key: Int, a: Int, b: String)
> val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
> val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
> df1.printSchema
> df2.printSchema
> df1 union df2
> {code}
> Result
> {code:java}
> defined class Entity
> df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: int ... 1 more 
> field]
> df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: string ... 1 more 
> field]
> converted
> root
> |-- key: integer (nullable = false)
> |-- a: integer (nullable = false)
> |-- b: string (nullable = true)
> root
> |-- key: integer (nullable = false)
> |-- b: string (nullable = true)
> |-- a: integer (nullable = false)
> org.apache.spark.sql.AnalysisException: Cannot up cast `a` from string to int 
> as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "a")
> - root class: "Entity"{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27855) Union failed between 2 datasets of the same type converted from different dataframes

2019-05-27 Thread Hao Ren (JIRA)
Hao Ren created SPARK-27855:
---

 Summary: Union failed between 2 datasets of the same type 
converted from different dataframes
 Key: SPARK-27855
 URL: https://issues.apache.org/jira/browse/SPARK-27855
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.3
Reporter: Hao Ren


2 Datasets of the same type converted from different dataframes can not union.

Here is the code to reproduce the problem. It seems `union` just checks the 
schema of the orignal dataframe, even if the two datasets have already been 
converted to the same type of dataset.
{code:java}
case class Entity(key: Int, a: Int, b: String)
val df1 = Seq((2,2,"2")).toDF("key", "a", "b").as[Entity]
val df2 = Seq((1,"1",1)).toDF("key", "b", "a").as[Entity]
df1.printSchema
df2.printSchema
df1 union df2
{code}
Result
{code:java}
defined class Entity df1: org.apache.spark.sql.Dataset[Entity] = [key: int, a: 
int ... 1 more field] df2: org.apache.spark.sql.Dataset[Entity] = [key: int, b: 
string ... 1 more field] converted root |-- key: integer (nullable = false) |-- 
a: integer (nullable = false) |-- b: string (nullable = true) root |-- key: 
integer (nullable = false) |-- b: string (nullable = true) |-- a: integer 
(nullable = false) org.apache.spark.sql.AnalysisException: Cannot up cast `a` 
from string to int as it may truncate The type path of the target object is: - 
field (class: "scala.Int", name: "a") - root class: "Entity" You can either add 
an expl
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely

2019-05-27 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848904#comment-16848904
 ] 

Sean Owen commented on SPARK-13182:
---

On its face that sounds like a YARN-related issue. Rescheduling a preempted 
task is correct, but not if there are not enough resources available to execute 
it. If resources became available but not for long enough to finish it, that's 
still an app-level and YARN policy issue.

> Spark Executor retries infinitely
> -
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Prabhu Joseph
>Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a 
> production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27803) fix column pruning for python UDF

2019-05-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27803.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24675
[https://github.com/apache/spark/pull/24675]

> fix column pruning for python UDF
> -
>
> Key: SPARK-27803
> URL: https://issues.apache.org/jira/browse/SPARK-27803
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15348) Hive ACID

2019-05-27 Thread Xianyin Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848881#comment-16848881
 ] 

Xianyin Xin commented on SPARK-15348:
-

The starting point (or goal) of Delta Lake is not ACID, but "Data Lake", and 
ACID is just one of its features. The ACID designs between hive and delta is 
very different, both have pros and cons. However, hive table and delta table 
are two datasources in spark's perspective, so a pluggable ACID support for 
different datasources within one framework is a choice. Maybe datasource V2 API 
can handle this.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0
>Reporter: Ran Haim
>Priority: Major
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27833) java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark

2019-05-27 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848840#comment-16848840
 ] 

Gabor Somogyi commented on SPARK-27833:
---

Maybe you could just copy the working sink with a different name, change things 
step-by-step until it breaks. At the first glance this doesn't look like a 
Spark problem.

>  java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> 
>
> Key: SPARK-27833
> URL: https://issues.apache.org/jira/browse/SPARK-27833
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
> Environment: spark 2.3.0
> java 1.8
> kafka version 0.10.
>Reporter: Raviteja
>Priority: Minor
>  Labels: spark-streaming-kafka
> Attachments: kafka_consumer_code.java, kafka_custom_sink.java, 
> kafka_error_log.txt
>
>
> Hi ,
> We have a requirement to read data from kafka, apply some transformation and 
> store data to database .For this we are implementing watermarking feature 
> along with aggregate function and  for storing we are writing our own sink 
> (Structured streaming) .we are using spark 2.3.0, java 1.8 and kafka version 
> 0.10.
>  We are getting the below error.
> "*java.lang.AssertionError: assertion failed: No plan for EventTimeWatermark 
> timestamp#39: timestamp, interval 2 minutes*"
>  
> works perfectly fine when we use Console as sink instead custom sink.  For 
> Debugging the issue, we are performing  "dataframe.show()" in our custom sink 
> and nothing else.  
> Please find the attachment for the Error log and the code. Please look into 
> this issue as this a blocker and we are not able to proceed further or find 
> any alternatives as we need watermarking feature. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27553) Operation log is not closed when close session

2019-05-27 Thread jinwensc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848823#comment-16848823
 ] 

jinwensc commented on SPARK-27553:
--

it was closed in org.apche.hive.service.cli.operation.Operation afterRun.
but can't get operation log from thriftserver.

> Operation log is not closed when close session
> --
>
> Key: SPARK-27553
> URL: https://issues.apache.org/jira/browse/SPARK-27553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> On Window
> 1. start spark-shell
> 2. start hive server in shell by 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.startWithContext(spark.sqlContext)
> 3. beeline connect to hive server
>     3.1 connect 
>           beeline -u jdbc:hive2://localhost:1
>     3.2 Run SQL
>           show tables;
>     3.3 quit beeline
>           !quit
> Get exception log
> {code}
>  19/04/24 11:38:22 ERROR HiveSessionImpl: Failed to cleanup ses
> sion log dir: SessionHandle [5827428b-d140-4fc0-8ad4-721c39b3ead0]
> java.io.IOException: Unable to delete file: 
> C:\Users\test\AppData\Local\Temp\test\operation_logs\5827428b-d140-4fc0-8ad4-721c39b3ead0\df9cd631-66e7-4303-9a4
> 1-a09bdefcf888
>  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2279)
>  at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653)
>  at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535)
>  at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270)
>  at 
> org.apache.hive.service.cli.session.HiveSessionImpl.cleanupSessionLogDir(HiveSessionImpl.java:671)
>  at 
> org.apache.hive.service.cli.session.HiveSessionImpl.close(HiveSessionImpl.java:643)
>  at 
> org.apache.hive.service.cli.session.HiveSessionImplwithUGI.close(HiveSessionImplwithUGI.java:109)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>  at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>  at com.sun.proxy.$Proxy19.close(Unknown Source)
>  at 
> org.apache.hive.service.cli.session.SessionManager.closeSession(SessionManager.java:280)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLSessionManager.closeSession(SparkSQLSessionManager.scala:76)
>  at org.apache.hive.service.cli.CLIService.closeSession(CLIService.java:237)
>  at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.CloseSession(ThriftCLIService.java:397)
>  at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1273)
>  at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$CloseSession.getResult(TCLIService.java:1258)
>  at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>  at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>  at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue

2019-05-27 Thread fengtlyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848811#comment-16848811
 ] 

fengtlyer commented on SPARK-27826:
---

Hi Hyukjin,

Our team think this is a compatibility issue.

We are fully understand, if we use format("hive") this line of code would  
work, However, all of our actions should work with "parquet" format.

Why it is not parquet format when we use impala created parquet table?

If we use impala SQL query "stored as parquet" in the Hue, then we checked the 
HDFS, the files end with ".parq", but we can't use 
"write.format("parquet").mode("append").saveAsTable()"  to append this table.

We think there should be some compatibility issues.

> saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" 
> format issue
> 
>
> Key: SPARK-27826
> URL: https://issues.apache.org/jira/browse/SPARK-27826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
> Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2
> CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1
>Reporter: fengtlyer
>Priority: Minor
>
> Hi Spark Dev Team,
> We tested a few times and found this bug can reappearance in multi Spark 
> version
> We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark 
> version 2.4.0-cdh6.1.1
> Both of them have this bug:
> 1. If one table created by Impala or Hive in the HUE, then in Spark code, 
> "write.format("parquet").mode("append").saveAsTable()" will case the format 
> issue (see the below error log)
> 2. Hive/Impala in the HUE created table, then 
> "write.format("parquet").mode("overwrite").saveAsTable()", this code still 
> does not work.
>  2.1 Hive/Impala in the HUE created table, and 
> "write.format("parquet").mode("overwrite").saveAsTable()", then 
> "write.format("parquet").mode("append").saveAsTable()" can work.
> 3. Hive/Impala in the HUE created table, then "insertInto()" still will work.
>  3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert 
> some new record, then try to use 
> "write.format("parquet").mode("append").saveAsTable()", it will get the same 
> format error log
> 4. Created parquet table and insert some data by Hive shell, then 
> "write.format("parquet").mode("append").saveAsTable()" can insert data, but 
> spark only shows data which insert by spark, and Hive only show data which 
> hive insert.
> === 
> Error Log 
> ===
> {code}
> spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: The format of the existing table 
> default.parquet_test_table is `HiveFileFormat`. It doesn't match the 
> specified format `ParquetFileFormat`.;
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
> at 
> 

[jira] [Commented] (SPARK-27808) Ability to ignore existing files for structured streaming

2019-05-27 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848769#comment-16848769
 ] 

Gabor Somogyi commented on SPARK-27808:
---

What I've suggested is a workaround of course and not efficient for huge data 
sets. I've had a look and seems doable.

> Ability to ignore existing files for structured streaming
> -
>
> Key: SPARK-27808
> URL: https://issues.apache.org/jira/browse/SPARK-27808
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Vladimir Matveev
>Priority: Major
>
> Currently it is not easily possible to make a structured streaming query to 
> ignore all of the existing data inside a directory and only process new 
> files, created after the job was started. See here for example: 
> [https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset]
>  
> My use case is to ignore everything which existed in the directory when the 
> streaming job is first started (and there are no checkpoints), but to behave 
> as usual when the stream is restarted, e.g. catch up reading new files since 
> the last restart. This would allow us to use the streaming job for continuous 
> processing, with all the benefits it brings, but also to keep the possibility 
> to reprocess the data in the batch fashion by a different job, drop the 
> checkpoints and make the streaming job only run for the new data.
>  
> It would be great to have an option similar to the `newFilesOnly` option on 
> the original StreamingContext.fileStream method: 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext@fileStream[K,V,F%3C:org.apache.hadoop.mapreduce.InputFormat[K,V]](directory:String,filter:org.apache.hadoop.fs.Path=%3EBoolean,newFilesOnly:Boolean)(implicitevidence$7:scala.reflect.ClassTag[K],implicitevidence$8:scala.reflect.ClassTag[V],implicitevidence$9:scala.reflect.ClassTag[F]):org.apache.spark.streaming.dstream.InputDStream[(K,V])]
> but probably with slightly different semantics, described above (ignore all 
> existing for the first run, catch up for the following runs)>



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql

2019-05-27 Thread kai zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kai zhao updated SPARK-27854:
-
Environment: 
Spark Version:1.6.2

HDP Version:2.5

JDK Version:1.8

OS Version:Redhat 7.3

 

Cluster Info:

8 nodes

Each node :

RAM: 256G

CPU:  Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz  (40 cores)

Disk:10*4T HDD+1T SSD

 

Yarn Config:

NodeManager Memory:210G

NodeManager Vcores:70

 

Runtime Information

Java Home=/opt/jdk1.8.0_131/jre
 Java Version=1.8.0_131 (Oracle Corporation)
 Scala Version=version 2.10.5

Spark Properties

spark.app.id=application_1558686555626_0024
 spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
 spark.driver.appUIAddress=[http://172.17.3.2:4040|http://172.17.3.2:4040/]
 spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
 
spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
 spark.driver.host=172.17.3.2
 spark.driver.maxResultSize=16g
 spark.driver.port=44591
 spark.dynamicAllocation.enabled=true
 spark.dynamicAllocation.initialExecutors=0
 spark.dynamicAllocation.maxExecutors=200
 spark.dynamicAllocation.minExecutors=0
 spark.eventLog.dir=hdfs:///spark-history
 spark.eventLog.enabled=true
 spark.executor.cores=5
 spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
 spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m
 
spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
 spark.executor.id=driver
 spark.executor.memory=16g
 spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a
 spark.hadoop.cacheConf=false
 spark.history.fs.logDirectory=hdfs:///spark-history
 spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
 spark.kryo.referenceTracking=false
 spark.kryoserializer.buffer.max=1024m
 spark.local.dir=/data/disk1/spark-local-dir
 spark.master=yarn-client
 spark.network.timeout=600s
 
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2
 
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=[http://ambari-node-2:8088/proxy/application_1558686555626_0024]
 spark.scheduler.allocation.file 
/usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml
 spark.scheduler.mode=FAIR
 spark.serializer=org.apache.spark.serializer.KryoSerializer
 spark.shuffle.managr=SORT
 spark.shuffle.service.enabled=true
 spark.shuffle.service.port=9339
 spark.submit.deployMode=client
 spark.ui.filters=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
 spark.yarn.am.cores=5
 spark.yarn.am.memory=16g
 spark.yarn.queue=default

  was:
Spark Version:1.6.2

HDP Version:2.5

JDK Version:1.8

OS Version:Redhat 7.3

 

Runtime Information

Java Home=/opt/jdk1.8.0_131/jre
Java Version=1.8.0_131 (Oracle Corporation)
Scala Version=version 2.10.5


Spark Properties

spark.app.id=application_1558686555626_0024
spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
spark.driver.appUIAddress=http://172.17.3.2:4040
spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.driver.host=172.17.3.2
spark.driver.maxResultSize=16g
spark.driver.port=44591
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.initialExecutors=0
spark.dynamicAllocation.maxExecutors=200
spark.dynamicAllocation.minExecutors=0
spark.eventLog.dir=hdfs:///spark-history
spark.eventLog.enabled=true
spark.executor.cores=5
spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m
spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.id=driver
spark.executor.memory=16g
spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a
spark.hadoop.cacheConf=false
spark.history.fs.logDirectory=hdfs:///spark-history
spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
spark.kryo.referenceTracking=false
spark.kryoserializer.buffer.max=1024m
spark.local.dir=/data/disk1/spark-local-dir
spark.master=yarn-client
spark.network.timeout=600s
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=http://ambari-node-2:8088/proxy/application_1558686555626_0024
spark.scheduler.allocation.file 
/usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml
spark.scheduler.mode=FAIR
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.managr=SORT
spark.shuffle.service.enabled=true

[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql

2019-05-27 Thread kai zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kai zhao updated SPARK-27854:
-
Environment: 
Spark Version:1.6.2

HDP Version:2.5

JDK Version:1.8

OS Version:Redhat 7.3

 

Runtime Information

Java Home=/opt/jdk1.8.0_131/jre
Java Version=1.8.0_131 (Oracle Corporation)
Scala Version=version 2.10.5


Spark Properties

spark.app.id=application_1558686555626_0024
spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
spark.driver.appUIAddress=http://172.17.3.2:4040
spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.driver.host=172.17.3.2
spark.driver.maxResultSize=16g
spark.driver.port=44591
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.initialExecutors=0
spark.dynamicAllocation.maxExecutors=200
spark.dynamicAllocation.minExecutors=0
spark.eventLog.dir=hdfs:///spark-history
spark.eventLog.enabled=true
spark.executor.cores=5
spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m
spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.id=driver
spark.executor.memory=16g
spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a
spark.hadoop.cacheConf=false
spark.history.fs.logDirectory=hdfs:///spark-history
spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
spark.kryo.referenceTracking=false
spark.kryoserializer.buffer.max=1024m
spark.local.dir=/data/disk1/spark-local-dir
spark.master=yarn-client
spark.network.timeout=600s
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=http://ambari-node-2:8088/proxy/application_1558686555626_0024
spark.scheduler.allocation.file 
/usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml
spark.scheduler.mode=FAIR
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.managr=SORT
spark.shuffle.service.enabled=true
spark.shuffle.service.port=9339
spark.submit.deployMode=client
spark.ui.filters=org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.yarn.am.cores=5
spark.yarn.am.memory=16g
spark.yarn.queue=default

> [Spark-SQL] OOM when using unequal join sql 
> 
>
> Key: SPARK-27854
> URL: https://issues.apache.org/jira/browse/SPARK-27854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
> Environment: Spark Version:1.6.2
> HDP Version:2.5
> JDK Version:1.8
> OS Version:Redhat 7.3
>  
> Runtime Information
> Java Home=/opt/jdk1.8.0_131/jre
> Java Version=1.8.0_131 (Oracle Corporation)
> Scala Version=version 2.10.5
> Spark Properties
> spark.app.id=application_1558686555626_0024
> spark.app.name=org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
> spark.driver.appUIAddress=http://172.17.3.2:4040
> spark.driver.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
> spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.driver.host=172.17.3.2
> spark.driver.maxResultSize=16g
> spark.driver.port=44591
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.initialExecutors=0
> spark.dynamicAllocation.maxExecutors=200
> spark.dynamicAllocation.minExecutors=0
> spark.eventLog.dir=hdfs:///spark-history
> spark.eventLog.enabled=true
> spark.executor.cores=5
> spark.executor.extraClassPath=/yinhai_platform/resources/spark_dep_jar/*
> spark.executor.extraJavaOptions=-XX:MaxPermSize=10240m
> spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
> spark.executor.id=driver
> spark.executor.memory=16g
> spark.externalBlockStore.folderName=spark-058bff7c-f76c-4a0e-86a3-b390f2f06d1a
> spark.hadoop.cacheConf=false
> spark.history.fs.logDirectory=hdfs:///spark-history
> spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
> spark.kryo.referenceTracking=false
> spark.kryoserializer.buffer.max=1024m
> spark.local.dir=/data/disk1/spark-local-dir
> spark.master=yarn-client
> spark.network.timeout=600s
> spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS=ambari-node-2
> spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES=http://ambari-node-2:8088/proxy/application_1558686555626_0024
> spark.scheduler.allocation.file 
> /usr/hdp/current/spark-thriftserver/conf/spark-thrift-fairscheduler.xml
> spark.scheduler.mode=FAIR
> 

[jira] [Commented] (SPARK-27837) Running rand() in SQL with seed of column results in error (rand(col1))

2019-05-27 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848752#comment-16848752
 ] 

Liang-Chi Hsieh commented on SPARK-27837:
-

I don't see it makes sense. I checked few DBs, and didn't see rand function 
work like your way.



> Running rand() in SQL with seed of column results in error (rand(col1))
> ---
>
> Key: SPARK-27837
> URL: https://issues.apache.org/jira/browse/SPARK-27837
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jason Ferrell
>Priority: Major
>
> Running this sql:
> with a as
> (
>  select 123 val1
>  union all
>  select 123 val1
>  union all
>  select 123 val1
> )
> select val1,rand(123),rand(val1)
> from a
> Results in error:  org.apache.spark.sql.AnalysisException: Input argument to 
> rand must be an integer, long or null literal.;
> It doesn't appear to recognize the value of the column as an int.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13182) Spark Executor retries infinitely

2019-05-27 Thread Atul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848749#comment-16848749
 ] 

Atul Anand edited comment on SPARK-13182 at 5/27/19 9:16 AM:
-

[~srowen] The issue here is spark does not consider this as failures, and so 
keeps retrying.

I have hit infinite retry in a valid scenario, please see. 
[here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]]

Basically yarn preempted spark containers as they were running on lower 
priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not 
consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.


was (Author: zxcvmnb):
[~srowen] The issue here is spark does not consider this as failures, and so 
keeps retrying.

I have hit infinite retry in a valid scenario, please see. 
[here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]].

Basically yarn preempted spark containers as they were running on lower 
priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not 
consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.

> Spark Executor retries infinitely
> -
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Prabhu Joseph
>Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 

[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql

2019-05-27 Thread kai zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kai zhao updated SPARK-27854:
-
Description: 
I am using Spark 1.6.2 which is from hdp package.

Reproduce Steps:

1.Start Spark thrift server Or write DataFrame Code

2.Execute SQL like:select * from table_a  left join table_b on table_a.fieldA 
<= table_b.fieldB

3.Wait until the job is finished

 

Actual Result:

SQL won't  execute failed With multipule task error:

a)ExecutorLostFailure (executor 119 exited caused by one of the running tasks) 
Reason: Container marked as failed: 

b)java.lang.OutOfMemoryError: Java heap space 

Expect:

SQL runs Successfully .

I have tried every method on the Internet . But it still won't work

  was:
I am using Spark 1.6.2 which is from hdp package.

Reproduce Steps:

1.Start Spark thrift server Or write DataFrame Code

2.Execute SQL like:select * from table_a  left join table_b on table_a.fieldA 
<= table_b.fieldB

3.Wait until the job is finished

 

Actual Result:

SQL won't  execute failed With multipule task error:

 

Expect:

SQL runs Successfully 

I have tried every method on the Internet . But it still won't work


> [Spark-SQL] OOM when using unequal join sql 
> 
>
> Key: SPARK-27854
> URL: https://issues.apache.org/jira/browse/SPARK-27854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: kai zhao
>Priority: Major
>
> I am using Spark 1.6.2 which is from hdp package.
> Reproduce Steps:
> 1.Start Spark thrift server Or write DataFrame Code
> 2.Execute SQL like:select * from table_a  left join table_b on table_a.fieldA 
> <= table_b.fieldB
> 3.Wait until the job is finished
>  
> Actual Result:
> SQL won't  execute failed With multipule task error:
> a)ExecutorLostFailure (executor 119 exited caused by one of the running 
> tasks) Reason: Container marked as failed: 
> b)java.lang.OutOfMemoryError: Java heap space 
> Expect:
> SQL runs Successfully .
> I have tried every method on the Internet . But it still won't work



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13182) Spark Executor retries infinitely

2019-05-27 Thread Atul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848749#comment-16848749
 ] 

Atul Anand edited comment on SPARK-13182 at 5/27/19 9:16 AM:
-

[~srowen] The issue here is spark does not consider this as failures, and so 
keeps retrying.

I have hit infinite retry in a valid scenario, please see 
[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them].

Basically yarn preempted spark containers as they were running on lower 
priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not 
consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.


was (Author: zxcvmnb):
[~srowen] The issue here is spark does not consider this as failures, and so 
keeps retrying.

I have hit infinite retry in a valid scenario, please see. 
[here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]]

Basically yarn preempted spark containers as they were running on lower 
priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not 
consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.

> Spark Executor retries infinitely
> -
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Prabhu Joseph
>Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 

[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely

2019-05-27 Thread Atul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848749#comment-16848749
 ] 

Atul Anand commented on SPARK-13182:


[~srowen] The issue here is spark does not consider this as failures, and so 
keeps retrying.

I have hit infinite retry in a valid scenario, please see. 
[here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]].

Basically yarn preempted spark containers as they were running on lower 
priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not 
consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 
failed because while it was being computed, its executor exited for a reason 
unrelated to the task. Not counting this failure towards the maximum number of 
failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.

> Spark Executor retries infinitely
> -
>
> Key: SPARK-13182
> URL: https://issues.apache.org/jira/browse/SPARK-13182
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Prabhu Joseph
>Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if 
> user passes some wrong JVM arguments with spark.executor.extraJavaOptions, 
> the first executor fails. But the job keeps on retrying, creating a new 
> executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf 
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" 
> /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than 
> ParallelGCThreads, JVM will crash. But the job does not exit, keeps on 
> creating executors and retrying.
> ..
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor 
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: 
> app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID 
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 
> 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: 
> app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a 
> production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql

2019-05-27 Thread kai zhao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kai zhao updated SPARK-27854:
-
Description: 
I am using Spark 1.6.2 which is from hdp package.

Reproduce Steps:

1.Start Spark thrift server Or write DataFrame Code

2.Execute SQL like:select * from table_a  left join table_b on table_a.fieldA 
<= table_b.fieldB

3.Wait until the job is finished

 

Actual Result:

SQL won't  execute failed With multipule task error:

 

Expect:

SQL runs Successfully 

I have tried every method on the Internet . But it still won't work

  was:
I am using Spark 1.6.2 which is from hdp package.

Reproduce Steps:

1.Start Spark thrift server Or write DataFrame Code

2.Execute SQL like:select * from table_a  left join table_b on table_a.fieldA 
<= table_b.fieldB

3.Wait until the job is finished

 

Actual Result:

SQL won't  execute failed With Error:

Expect:

SQL runs Successfully 

I have tried every method on the Internet . But it still won't work


> [Spark-SQL] OOM when using unequal join sql 
> 
>
> Key: SPARK-27854
> URL: https://issues.apache.org/jira/browse/SPARK-27854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: kai zhao
>Priority: Major
>
> I am using Spark 1.6.2 which is from hdp package.
> Reproduce Steps:
> 1.Start Spark thrift server Or write DataFrame Code
> 2.Execute SQL like:select * from table_a  left join table_b on table_a.fieldA 
> <= table_b.fieldB
> 3.Wait until the job is finished
>  
> Actual Result:
> SQL won't  execute failed With multipule task error:
>  
> Expect:
> SQL runs Successfully 
> I have tried every method on the Internet . But it still won't work



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27854) [Spark-SQL] OOM when using unequal join sql

2019-05-27 Thread kai zhao (JIRA)
kai zhao created SPARK-27854:


 Summary: [Spark-SQL] OOM when using unequal join sql 
 Key: SPARK-27854
 URL: https://issues.apache.org/jira/browse/SPARK-27854
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: kai zhao


I am using Spark 1.6.2 which is from hdp package.

Reproduce Steps:

1.Start Spark thrift server Or write DataFrame Code

2.Execute SQL like:select * from table_a  left join table_b on table_a.fieldA 
<= table_b.fieldB

3.Wait until the job is finished

 

Actual Result:

SQL won't  execute failed With Error:

Expect:

SQL runs Successfully 

I have tried every method on the Internet . But it still won't work



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27851) Allow for custom BroadcastMode#transform return values

2019-05-27 Thread Marc Arndt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Arndt updated SPARK-27851:
---
Summary: Allow for custom BroadcastMode#transform return values  (was: 
Allow for custom BroadcastMode return values)

> Allow for custom BroadcastMode#transform return values
> --
>
> Key: SPARK-27851
> URL: https://issues.apache.org/jira/browse/SPARK-27851
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.4.3
>Reporter: Marc Arndt
>Priority: Major
>
> According to the BroadcastMode API the BroadcastMode#transform methods are 
> allows to return a result object of an arbitrary type:
> {code:scala}
> /**
>  * Marker trait to identify the shape in which tuples are broadcasted. 
> Typical examples of this are
>  * identity (tuples remain unchanged) or hashed (tuples are converted into 
> some hash index).
>  */
> trait BroadcastMode {
>   def transform(rows: Array[InternalRow]): Any
>   def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any
>   def canonicalized: BroadcastMode
> }
> {code}
> When looking at the code which later uses the instantiated BroadcastMode 
> objects in BroadcastExchangeExec it becomes that this is not really the base. 
> The following lines in BroadcastExchangeExec suggest that only objects of 
> type HashedRelation and Array[InternalRow] are allowed as a result for the 
> BroadcastMode#transform methods:
> {code:scala}
> // Construct the relation.
> val relation = mode.transform(input, Some(numRows))
> val dataSize = relation match {
> case map: HashedRelation =>
> map.estimatedSize
> case arr: Array[InternalRow] =>
> arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
> case _ =>
> throw new SparkException("[BUG] BroadcastMode.transform returned 
> unexpected type: " +
> relation.getClass.getName)
> }
> {code}
> I believe that this is the only occurrence in the code where the result of 
> the BroadcastMode#transform method must be either of type HashedRelation or 
> Array[InternalRow]. Therefore to allow for broader custom implementations of 
> the BroadcastMode I believe it would be a good idea to solve the calculation 
> of the data size of the broadcast value in an independent way of the used 
> BroadcastMode implemented.
> One way this could be done is by providing an additional 
> BroadcastMode#calculateDataSize method, which needs to be implemented by the 
> BroadcastMode implementations. This methods could then return the required 
> number of bytes for a given broadcast value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27853) Allow for custom Partitioning implementations

2019-05-27 Thread Marc Arndt (JIRA)
Marc Arndt created SPARK-27853:
--

 Summary: Allow for custom Partitioning implementations
 Key: SPARK-27853
 URL: https://issues.apache.org/jira/browse/SPARK-27853
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 2.4.3
Reporter: Marc Arndt


When partitioning a Dataset Spark uses the physical plan element 
ShuffleExchangeExec together with a Partitioning instance. 

I find myself in situation where I need to provide my own partitioning 
criteria, that decides to which partition each InternalRow should belong. 
According to the Spark API I would expect to be able to provide my custom 
partitioning criteria as a custom implementation of the Partitioning interface.

Sadly after implementing a custom Partitioning implementation you will receive 
a "Exchange not implemented for $newPartitioning" error message, because of the 
following code inside the ShuffleExchangeExec#prepareShuffleDependency method:

{code:scala}
val part: Partitioner = newPartitioning match {
case RoundRobinPartitioning(numPartitions) => new 
HashPartitioner(numPartitions)
case HashPartitioning(_, n) =>
new Partitioner {
override def numPartitions: Int = n
// For HashPartitioning, the partitioning key is already a valid 
partition ID, as we use
// `HashPartitioning.partitionIdExpression` to produce partitioning key.
override def getPartition(key: Any): Int = key.asInstanceOf[Int]
}
case RangePartitioning(sortingExpressions, numPartitions) =>
// Internally, RangePartitioner runs a job on the RDD that samples keys to 
compute
// partition bounds. To get accurate samples, we need to copy the mutable 
keys.
val rddForSampling = rdd.mapPartitionsInternal { iter =>
val mutablePair = new MutablePair[InternalRow, Null]()
iter.map(row => mutablePair.update(row.copy(), null))
}
implicit val ordering = new LazilyGeneratedOrdering(sortingExpressions, 
outputAttributes)
new RangePartitioner(
numPartitions,
rddForSampling,
ascending = true,
samplePointsPerPartitionHint = 
SQLConf.get.rangeExchangeSampleSizePerPartition)
case SinglePartition =>
new Partitioner {
override def numPartitions: Int = 1
override def getPartition(key: Any): Int = 0
}
case _ => sys.error(s"Exchange not implemented for $newPartitioning")
// TODO: Handle BroadcastPartitioning.
}
def getPartitionKeyExtractor(): InternalRow => Any = newPartitioning match {
case RoundRobinPartitioning(numPartitions) =>
// Distributes elements evenly across output partitions, starting from a 
random partition.
var position = new 
Random(TaskContext.get().partitionId()).nextInt(numPartitions)
(row: InternalRow) => {
// The HashPartitioner will handle the `mod` by the number of partitions
position += 1
position
}
case h: HashPartitioning =>
val projection = UnsafeProjection.create(h.partitionIdExpression :: Nil, 
outputAttributes)
row => projection(row).getInt(0)
case RangePartitioning(_, _) | SinglePartition => identity
case _ => sys.error(s"Exchange not implemented for $newPartitioning")
}
{code}

The code in the above code snippet matches the given Partitioning instance 
"newPartitioning" against a set of given hardcoded Partitioning types. When 
adding a new Partitioning implementation the pattern matching won't be able to 
find a pattern for it and therefore will use the fallback case:

{code:java}
case _ => sys.error(s"Exchange not implemented for $newPartitioning")
{code}

and throw an exception.

To be able to provide custom partition behaviour I would suggest to change the 
implementation in ShuffleExchangeExec to be able to work with an arbitrary 
Partitioning implementation. For the Partition creation I would imagine that 
this can be done in a nice way inside the Partitioning classes via a 
Partitioning#createPartitioner method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27852:


Assignee: (was: Apache Spark)

> One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
> 
>
> Key: SPARK-27852
> URL: https://issues.apache.org/jira/browse/SPARK-27852
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Shuaiqi Ge
>Priority: Major
>
>  In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* 
> functions, the first of which executes _*updateBytesWritten*_ function while 
> the other doesn't. I think writeMetrics should record all the information 
> about writing operations, some data of which will be displayed in the Spark 
> jobs UI such as the data size of shuffle read and shuffle write.
> {code:java}
> def write(key: Any, value: Any) {
>if (!streamOpen) {
>  open()
>}
>objOut.writeKey(key)
>objOut.writeValue(value)
>recordWritten()
> }
> override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
>if (!streamOpen) {
>   open()
>}
>bs.write(kvBytes, offs, len)
>// updateBytesWritten()   // the function is missed
> }
> **
> * Notify the writer that a record worth of bytes has been written with 
> OutputStream#write.
> */
> def recordWritten(): Unit = {
>numRecordsWritten += 1
>writeMetrics.incRecordsWritten(1)
>if (numRecordsWritten % 16384 == 0) {
>  updateBytesWritten()
>}
> }
> /**
> * Report the number of bytes written in this writer's shuffle write metrics.
> * Note that this is only valid before the underlying streams are closed.
> */
> private def updateBytesWritten() {
>val pos = channel.position()
>writeMetrics.incBytesWritten(pos - reportedPosition)
>reportedPosition = pos
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27852:


Assignee: Apache Spark

> One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
> 
>
> Key: SPARK-27852
> URL: https://issues.apache.org/jira/browse/SPARK-27852
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Shuaiqi Ge
>Assignee: Apache Spark
>Priority: Major
>
>  In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* 
> functions, the first of which executes _*updateBytesWritten*_ function while 
> the other doesn't. I think writeMetrics should record all the information 
> about writing operations, some data of which will be displayed in the Spark 
> jobs UI such as the data size of shuffle read and shuffle write.
> {code:java}
> def write(key: Any, value: Any) {
>if (!streamOpen) {
>  open()
>}
>objOut.writeKey(key)
>objOut.writeValue(value)
>recordWritten()
> }
> override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
>if (!streamOpen) {
>   open()
>}
>bs.write(kvBytes, offs, len)
>// updateBytesWritten()   // the function is missed
> }
> **
> * Notify the writer that a record worth of bytes has been written with 
> OutputStream#write.
> */
> def recordWritten(): Unit = {
>numRecordsWritten += 1
>writeMetrics.incRecordsWritten(1)
>if (numRecordsWritten % 16384 == 0) {
>  updateBytesWritten()
>}
> }
> /**
> * Report the number of bytes written in this writer's shuffle write metrics.
> * Note that this is only valid before the underlying streams are closed.
> */
> private def updateBytesWritten() {
>val pos = channel.position()
>writeMetrics.incBytesWritten(pos - reportedPosition)
>reportedPosition = pos
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala

2019-05-27 Thread Shuaiqi Ge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaiqi Ge updated SPARK-27852:
---
Description: 
 In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, 
the first of which executes _*updateBytesWritten*_ function while the other 
doesn't. I think writeMetrics should record all the information about writing 
operations, some data of which will be displayed in the Spark jobs UI such as 
the data size of shuffle read and shuffle write.
{code:java}
def write(key: Any, value: Any) {
   if (!streamOpen) {
 open()
   }

   objOut.writeKey(key)
   objOut.writeValue(value)
   recordWritten()
}

override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
   if (!streamOpen) {
  open()
   }
   bs.write(kvBytes, offs, len)
   // updateBytesWritten()   // the function is missed
}

**
* Notify the writer that a record worth of bytes has been written with 
OutputStream#write.
*/
def recordWritten(): Unit = {
   numRecordsWritten += 1
   writeMetrics.incRecordsWritten(1)

   if (numRecordsWritten % 16384 == 0) {
 updateBytesWritten()
   }
}

/**
* Report the number of bytes written in this writer's shuffle write metrics.
* Note that this is only valid before the underlying streams are closed.
*/
private def updateBytesWritten() {
   val pos = channel.position()
   writeMetrics.incBytesWritten(pos - reportedPosition)
   reportedPosition = pos
}
{code}
 

 

  was:
 In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, 
the first of which executes _*updateBytesWritten*_ function while the other 
doesn't. I think writeMetrics should record all the information about writing 
operations, some data of which will be displayed in the Spark jobs UI such as 
the data size of shuffle read and shuffle write.
{code:java}
def write(key: Any, value: Any) {
   if (!streamOpen) {
 open()
   }

   objOut.writeKey(key)
   objOut.writeValue(value)
   recordWritten()
}

override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
   if (!streamOpen) {
  open()
   }
   bs.write(kvBytes, offs, len)
}

**
* Notify the writer that a record worth of bytes has been written with 
OutputStream#write.
*/
def recordWritten(): Unit = {
   numRecordsWritten += 1
   writeMetrics.incRecordsWritten(1)

   if (numRecordsWritten % 16384 == 0) {
 updateBytesWritten()
   }
}

/**
* Report the number of bytes written in this writer's shuffle write metrics.
* Note that this is only valid before the underlying streams are closed.
*/
private def updateBytesWritten() {
   val pos = channel.position()
   writeMetrics.incBytesWritten(pos - reportedPosition)
   reportedPosition = pos
}
{code}
 

 


> One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
> 
>
> Key: SPARK-27852
> URL: https://issues.apache.org/jira/browse/SPARK-27852
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Shuaiqi Ge
>Priority: Major
>
>  In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* 
> functions, the first of which executes _*updateBytesWritten*_ function while 
> the other doesn't. I think writeMetrics should record all the information 
> about writing operations, some data of which will be displayed in the Spark 
> jobs UI such as the data size of shuffle read and shuffle write.
> {code:java}
> def write(key: Any, value: Any) {
>if (!streamOpen) {
>  open()
>}
>objOut.writeKey(key)
>objOut.writeValue(value)
>recordWritten()
> }
> override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
>if (!streamOpen) {
>   open()
>}
>bs.write(kvBytes, offs, len)
>// updateBytesWritten()   // the function is missed
> }
> **
> * Notify the writer that a record worth of bytes has been written with 
> OutputStream#write.
> */
> def recordWritten(): Unit = {
>numRecordsWritten += 1
>writeMetrics.incRecordsWritten(1)
>if (numRecordsWritten % 16384 == 0) {
>  updateBytesWritten()
>}
> }
> /**
> * Report the number of bytes written in this writer's shuffle write metrics.
> * Note that this is only valid before the underlying streams are closed.
> */
> private def updateBytesWritten() {
>val pos = channel.position()
>writeMetrics.incBytesWritten(pos - reportedPosition)
>reportedPosition = pos
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala

2019-05-27 Thread Shuaiqi Ge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shuaiqi Ge updated SPARK-27852:
---
Description: 
 In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, 
the first of which executes _*updateBytesWritten*_ function while the other 
doesn't. I think writeMetrics should record all the information about writing 
operations, some data of which will be displayed in the Spark jobs UI such as 
the data size of shuffle read and shuffle write.
{code:java}
def write(key: Any, value: Any) {
   if (!streamOpen) {
 open()
   }

   objOut.writeKey(key)
   objOut.writeValue(value)
   recordWritten()
}

override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
   if (!streamOpen) {
  open()
   }
   bs.write(kvBytes, offs, len)
}

**
* Notify the writer that a record worth of bytes has been written with 
OutputStream#write.
*/
def recordWritten(): Unit = {
   numRecordsWritten += 1
   writeMetrics.incRecordsWritten(1)

   if (numRecordsWritten % 16384 == 0) {
 updateBytesWritten()
   }
}

/**
* Report the number of bytes written in this writer's shuffle write metrics.
* Note that this is only valid before the underlying streams are closed.
*/
private def updateBytesWritten() {
   val pos = channel.position()
   writeMetrics.incBytesWritten(pos - reportedPosition)
   reportedPosition = pos
}
{code}
 

 

  was:
 In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, 
the first of which executes _*updateBytesWritten*_ function while the other 
doesn't. I think writeMetrics should record all the information about writing 
operation, some data of which will displayed in the Spark jobs UI such as the 
data size of shuffle read and shuffle write.
{code:java}
def write(key: Any, value: Any) {
   if (!streamOpen) {
 open()
   }

   objOut.writeKey(key)
   objOut.writeValue(value)
   recordWritten()
}

override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
   if (!streamOpen) {
  open()
   }
   bs.write(kvBytes, offs, len)
}

**
* Notify the writer that a record worth of bytes has been written with 
OutputStream#write.
*/
def recordWritten(): Unit = {
   numRecordsWritten += 1
   writeMetrics.incRecordsWritten(1)

   if (numRecordsWritten % 16384 == 0) {
 updateBytesWritten()
   }
}

/**
* Report the number of bytes written in this writer's shuffle write metrics.
* Note that this is only valid before the underlying streams are closed.
*/
private def updateBytesWritten() {
   val pos = channel.position()
   writeMetrics.incBytesWritten(pos - reportedPosition)
   reportedPosition = pos
}
{code}
 

 


> One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala
> 
>
> Key: SPARK-27852
> URL: https://issues.apache.org/jira/browse/SPARK-27852
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Shuaiqi Ge
>Priority: Major
>
>  In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* 
> functions, the first of which executes _*updateBytesWritten*_ function while 
> the other doesn't. I think writeMetrics should record all the information 
> about writing operations, some data of which will be displayed in the Spark 
> jobs UI such as the data size of shuffle read and shuffle write.
> {code:java}
> def write(key: Any, value: Any) {
>if (!streamOpen) {
>  open()
>}
>objOut.writeKey(key)
>objOut.writeValue(value)
>recordWritten()
> }
> override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
>if (!streamOpen) {
>   open()
>}
>bs.write(kvBytes, offs, len)
> }
> **
> * Notify the writer that a record worth of bytes has been written with 
> OutputStream#write.
> */
> def recordWritten(): Unit = {
>numRecordsWritten += 1
>writeMetrics.incRecordsWritten(1)
>if (numRecordsWritten % 16384 == 0) {
>  updateBytesWritten()
>}
> }
> /**
> * Report the number of bytes written in this writer's shuffle write metrics.
> * Note that this is only valid before the underlying streams are closed.
> */
> private def updateBytesWritten() {
>val pos = channel.position()
>writeMetrics.incBytesWritten(pos - reportedPosition)
>reportedPosition = pos
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27852) One updateBytesWritten operaton may be missed in DiskBlockObjectWriter.scala

2019-05-27 Thread Shuaiqi Ge (JIRA)
Shuaiqi Ge created SPARK-27852:
--

 Summary: One updateBytesWritten operaton may be missed in 
DiskBlockObjectWriter.scala
 Key: SPARK-27852
 URL: https://issues.apache.org/jira/browse/SPARK-27852
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Shuaiqi Ge


 In *_DiskBlockObjectWriter.scala_*, there are 2 overload *_write_* functions, 
the first of which executes _*updateBytesWritten*_ function while the other 
doesn't. I think writeMetrics should record all the information about writing 
operation, some data of which will displayed in the Spark jobs UI such as the 
data size of shuffle read and shuffle write.
{code:java}
def write(key: Any, value: Any) {
   if (!streamOpen) {
 open()
   }

   objOut.writeKey(key)
   objOut.writeValue(value)
   recordWritten()
}

override def write(kvBytes: Array[Byte], offs: Int, len: Int): Unit = {
   if (!streamOpen) {
  open()
   }
   bs.write(kvBytes, offs, len)
}

**
* Notify the writer that a record worth of bytes has been written with 
OutputStream#write.
*/
def recordWritten(): Unit = {
   numRecordsWritten += 1
   writeMetrics.incRecordsWritten(1)

   if (numRecordsWritten % 16384 == 0) {
 updateBytesWritten()
   }
}

/**
* Report the number of bytes written in this writer's shuffle write metrics.
* Note that this is only valid before the underlying streams are closed.
*/
private def updateBytesWritten() {
   val pos = channel.position()
   writeMetrics.incBytesWritten(pos - reportedPosition)
   reportedPosition = pos
}
{code}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27851) Allow for custom BroadcastMode return values

2019-05-27 Thread Marc Arndt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Arndt updated SPARK-27851:
---
Description: 
According to the BroadcastMode API the BroadcastMode#transform methods are 
allows to return a result object of an arbitrary type:

{code:scala}
/**
 * Marker trait to identify the shape in which tuples are broadcasted. Typical 
examples of this are
 * identity (tuples remain unchanged) or hashed (tuples are converted into some 
hash index).
 */
trait BroadcastMode {
  def transform(rows: Array[InternalRow]): Any

  def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any

  def canonicalized: BroadcastMode
}
{code}

When looking at the code which later uses the instantiated BroadcastMode 
objects in BroadcastExchangeExec it becomes that this is not really the base. 

The following lines in BroadcastExchangeExec suggest that only objects of type 
HashedRelation and Array[InternalRow] are allowed as a result for the 
BroadcastMode#transform methods:

{code:scala}
// Construct the relation.
val relation = mode.transform(input, Some(numRows))

val dataSize = relation match {
case map: HashedRelation =>
map.estimatedSize
case arr: Array[InternalRow] =>
arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
case _ =>
throw new SparkException("[BUG] BroadcastMode.transform returned 
unexpected type: " +
relation.getClass.getName)
}
{code}

I believe that this is the only occurrence in the code where the result of the 
BroadcastMode#transform method must be either of type HashedRelation or 
Array[InternalRow]. Therefore to allow for broader custom implementations of 
the BroadcastMode I believe it would be a good idea to solve the calculation of 
the data size of the broadcast value in an independent way of the used 
BroadcastMode implemented.

One way this could be done is by providing an additional 
BroadcastMode#calculateDataSize method, which needs to be implemented by the 
BroadcastMode implementations. This methods could then return the required 
number of bytes for a given broadcast value.

  was:
According to the BroadcastMode API the BroadcastMode#transform methods are 
allows to return a result object of an arbitrary type:

{code:scala}
/**
 * Marker trait to identify the shape in which tuples are broadcasted. Typical 
examples of this are
 * identity (tuples remain unchanged) or hashed (tuples are converted into some 
hash index).
 */
trait BroadcastMode {
  def transform(rows: Array[InternalRow]): Any

  def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any

  def canonicalized: BroadcastMode
}
{code}

When looking at the code which later uses the instantiated BroadcastMode 
objects in BroadcastExchangeExec it becomes that this is not really the base. 

The following lines in BroadcastExchangeExec suggest that only objects of type 
HashRElation and Array[InternalRow] are allowed as a result for the 
BroadcastMode#transform methods:

{code:scala}
// Construct the relation.
val relation = mode.transform(input, Some(numRows))

val dataSize = relation match {
case map: HashedRelation =>
map.estimatedSize
case arr: Array[InternalRow] =>
arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
case _ =>
throw new SparkException("[BUG] BroadcastMode.transform returned 
unexpected type: " +
relation.getClass.getName)
}
{code}

I believe that this is the only occurrence in the code where the result of the 
BroadcastMode#transform method must be either of type HashedRelation or 
Array[InternalRow]. Therefore to allow for broader custom implementations of 
the BroadcastMode I believe it would be a good idea to solve the calculation of 
the data size of the broadcast value in an independent way of the used 
BroadcastMode implemented.

One way this could be done is by providing an additional 
BroadcastMode#calculateDataSize method, which needs to be implemented by the 
BroadcastMode implementations. This methods could then return the required 
number of bytes for a given broadcast value.


> Allow for custom BroadcastMode return values
> 
>
> Key: SPARK-27851
> URL: https://issues.apache.org/jira/browse/SPARK-27851
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 2.4.3
>Reporter: Marc Arndt
>Priority: Major
>
> According to the BroadcastMode API the BroadcastMode#transform methods are 
> allows to return a result object of an arbitrary type:
> {code:scala}
> /**
>  * Marker trait to identify the shape in which tuples are broadcasted. 
> Typical examples of this are
>  * identity (tuples remain unchanged) or hashed (tuples are converted into 
> some hash index).
>  */
> trait BroadcastMode {
>   

[jira] [Created] (SPARK-27851) Allow for custom BroadcastMode return values

2019-05-27 Thread Marc Arndt (JIRA)
Marc Arndt created SPARK-27851:
--

 Summary: Allow for custom BroadcastMode return values
 Key: SPARK-27851
 URL: https://issues.apache.org/jira/browse/SPARK-27851
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 2.4.3
Reporter: Marc Arndt


According to the BroadcastMode API the BroadcastMode#transform methods are 
allows to return a result object of an arbitrary type:

{code:scala}
/**
 * Marker trait to identify the shape in which tuples are broadcasted. Typical 
examples of this are
 * identity (tuples remain unchanged) or hashed (tuples are converted into some 
hash index).
 */
trait BroadcastMode {
  def transform(rows: Array[InternalRow]): Any

  def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any

  def canonicalized: BroadcastMode
}
{code}

When looking at the code which later uses the instantiated BroadcastMode 
objects in BroadcastExchangeExec it becomes that this is not really the base. 

The following lines in BroadcastExchangeExec suggest that only objects of type 
HashRElation and Array[InternalRow] are allowed as a result for the 
BroadcastMode#transform methods:

{code:scala}
// Construct the relation.
val relation = mode.transform(input, Some(numRows))

val dataSize = relation match {
case map: HashedRelation =>
map.estimatedSize
case arr: Array[InternalRow] =>
arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
case _ =>
throw new SparkException("[BUG] BroadcastMode.transform returned 
unexpected type: " +
relation.getClass.getName)
}
{code}

I believe that this is the only occurrence in the code where the result of the 
BroadcastMode#transform method must be either of type HashedRelation or 
Array[InternalRow]. Therefore to allow for broader custom implementations of 
the BroadcastMode I believe it would be a good idea to solve the calculation of 
the data size of the broadcast value in an independent way of the used 
BroadcastMode implemented.

One way this could be done is by providing an additional 
BroadcastMode#calculateDataSize method, which needs to be implemented by the 
BroadcastMode implementations. This methods could then return the required 
number of bytes for a given broadcast value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-27 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848702#comment-16848702
 ] 

Gabor Somogyi commented on SPARK-27648:
---

OK. Could you please calculate the 2 state graphs like numRowsTotal, etc...?


> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png, 
> image-2019-05-24-10-20-25-723.png, image-2019-05-27-10-10-30-460.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27850) Make SparkPlan#doExecuteBroadcast public

2019-05-27 Thread Marc Arndt (JIRA)
Marc Arndt created SPARK-27850:
--

 Summary: Make SparkPlan#doExecuteBroadcast public
 Key: SPARK-27850
 URL: https://issues.apache.org/jira/browse/SPARK-27850
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, Spark Core, SQL
Affects Versions: 2.4.3
Reporter: Marc Arndt


The handling of broadcasts of SparkPlan objects is handled inside the 
SparkPlan#executeBroadcast method. According to the documentation of SparkPlan 
to provide custom broadcast functionality the `doExecuteBroadcast` method 
should be overriden as indicated by the comment:

{code:scala}
  /**
   * Returns the result of this query as a broadcast variable by delegating to 
`doExecuteBroadcast`
   * after preparations.
   *
   * Concrete implementations of SparkPlan should override `doExecuteBroadcast`.
   */
  final def executeBroadcast[T](): broadcast.Broadcast[T] = executeQuery {
if (isCanonicalizedPlan) {
  throw new IllegalStateException("A canonicalized plan is not supposed to 
be executed.")
}
doExecuteBroadcast()
  }
{code}

When looking at the definition of SparkPlan#doExecuteBroadcast:

{code:scala}
  /**
   * Produces the result of the query as a broadcast variable.
   *
   * Overridden by concrete implementations of SparkPlan.
   */
  protected[sql] def doExecuteBroadcast[T](): broadcast.Broadcast[T] = {
throw new UnsupportedOperationException(s"$nodeName does not implement 
doExecuteBroadcast")
  }
{code}

it becomes apparent that it is not possible to override the method from 
user-defined SparkPlan implementations, because the method has been defined as 
package protected.

To allow custom SparkPlan implementations to provide their own broadcast 
operations I ask to change the SparkPlan#doExecuteBroadcast to be a public 
method, so that all SparkPlan implementations, independent of the package they 
belong to, can override it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27849) Redact treeString of FileTable and DataSourceV2ScanExecBase

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27849:


Assignee: Apache Spark

> Redact treeString of FileTable and DataSourceV2ScanExecBase
> ---
>
> Key: SPARK-27849
> URL: https://issues.apache.org/jira/browse/SPARK-27849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> To follow https://github.com/apache/spark/pull/17397, the output of FileTable 
> and DataSourceV2ScanExecBase can contain sensitive information (like Amazon 
> keys). Such information should not end up in logs, or be exposed to non 
> privileged users.
> Add a redaction facility for these output to resolve the issue. A user can 
> enable this by setting a regex in the same spark.redaction.string.regex 
> configuration as V1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27849) Redact treeString of FileTable and DataSourceV2ScanExecBase

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27849:


Assignee: (was: Apache Spark)

> Redact treeString of FileTable and DataSourceV2ScanExecBase
> ---
>
> Key: SPARK-27849
> URL: https://issues.apache.org/jira/browse/SPARK-27849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> To follow https://github.com/apache/spark/pull/17397, the output of FileTable 
> and DataSourceV2ScanExecBase can contain sensitive information (like Amazon 
> keys). Such information should not end up in logs, or be exposed to non 
> privileged users.
> Add a redaction facility for these output to resolve the issue. A user can 
> enable this by setting a regex in the same spark.redaction.string.regex 
> configuration as V1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27849) Redact treeString of FileTable and DataSourceV2ScanExecBase

2019-05-27 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-27849:
--

 Summary: Redact treeString of FileTable and 
DataSourceV2ScanExecBase
 Key: SPARK-27849
 URL: https://issues.apache.org/jira/browse/SPARK-27849
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


To follow https://github.com/apache/spark/pull/17397, the output of FileTable 
and DataSourceV2ScanExecBase can contain sensitive information (like Amazon 
keys). Such information should not end up in logs, or be exposed to non 
privileged users.
Add a redaction facility for these output to resolve the issue. A user can 
enable this by setting a regex in the same spark.redaction.string.regex 
configuration as V1.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27322) DataSourceV2: Select from multiple catalogs

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27322:


Assignee: Apache Spark

> DataSourceV2: Select from multiple catalogs
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: Apache Spark
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27322) DataSourceV2: Select from multiple catalogs

2019-05-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27322:


Assignee: (was: Apache Spark)

> DataSourceV2: Select from multiple catalogs
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2019-05-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848667#comment-16848667
 ] 

Hyukjin Kwon commented on SPARK-13283:
--

It's just closed due to EOL affect version set. See 
http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27238.html

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Major
>  Labels: bulk-closed
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2019-05-27 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848660#comment-16848660
 ] 

Maciej Bryński commented on SPARK-13283:


[~hyukjin.kwon]
What is resolution: incomplete ?

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Priority: Major
>  Labels: bulk-closed
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org