date:20190918

[jira] [Commented] (SPARK-29142) Pyspark clustering models support column setters/getters/predict

2019-09-18 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933081#comment-16933081
 ] 

Huaxin Gao commented on SPARK-29142:


I will work on this. Thanks!

> Pyspark clustering models support column setters/getters/predict
> 
>
> Key: SPARK-29142
> URL: https://issues.apache.org/jira/browse/SPARK-29142
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Unlike the reg/clf models, clustering models do not have some common class, 
> so we need to add them one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933077#comment-16933077
 ] 

Dongjoon Hyun commented on SPARK-29106:
---

Ur, [~huangtianhua]. You should update the `Description`. Comments are hidden 
easily. Could you organize the information into `Description?

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28989) Add `spark.sql.ansi.enabled`

2019-09-18 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-28989.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Add `spark.sql.ansi.enabled`
> 
>
> Key: SPARK-28989
> URL: https://issues.apache.org/jira/browse/SPARK-28989
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, there are new configurations for compatibility with ANSI SQL:
> * spark.sql.parser.ansi.enabled
> * spark.sql.decimalOperations.nullOnOverflow
> * spark.sql.failOnIntegralTypeOverflow
> To make it simple and straightforward, let's merge these configurations into 
> a single one, `spark.sql.ansi.enabled`. When the configuration is true, Spark 
> tries to conform to ANSI SQL specification. It will be disabled by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearance issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Summary: Fix the appearance issue on timeline view  (was: Fix the appearnce 
issue on timeline view)

> Fix the appearance issue on timeline view
> -
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
> See the attachment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Attachment: SPARK-29168.url

> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
> See the attachment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Attachment: SPARK-29168.url

> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
> See the attachment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Attachment: (was: SPARK-29168.url)

> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
> See the attachment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Attachment: (was: SPARK-29168.url)

> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
> See the attachment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Description: 
In WebUI, executor bar's color changes blue to green with no meaning when you 
click it.

See the attachment.

 

  was:
In WebUI, executor bar's color changes blue to green with no meaning when you 
click it.

 


> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
> See the attachment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Attachment: (was: html_before_c.png)

> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Attachment: after_click.png

> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Komiyama updated SPARK-29168:

Attachment: html_before_c.png

> Fix the appearnce issue on timeline view
> 
>
> Key: SPARK-29168
> URL: https://issues.apache.org/jira/browse/SPARK-29168
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: after_click.png
>
>
> In WebUI, executor bar's color changes blue to green with no meaning when you 
> click it.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29168) Fix the appearnce issue on timeline view

2019-09-18 Thread Tomoko Komiyama (Jira)

Tomoko Komiyama created SPARK-29168:
---

 Summary: Fix the appearnce issue on timeline view
 Key: SPARK-29168
 URL: https://issues.apache.org/jira/browse/SPARK-29168
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Tomoko Komiyama


In WebUI, executor bar's color changes blue to green with no meaning when you 
click it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29167) Metrics of Analyzer/Optimizer use Scientific counting is not human readable

2019-09-18 Thread angerszhu (Jira)

angerszhu created SPARK-29167:
-

 Summary: Metrics of Analyzer/Optimizer use Scientific counting is 
not human readable
 Key: SPARK-29167
 URL: https://issues.apache.org/jira/browse/SPARK-29167
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: angerszhu


Metrics of Analyzer/Optimizer use Scientific counting is not human readable

!image-2019-09-19-11-36-18-966.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933014#comment-16933014
 ] 

huangtianhua commented on SPARK-29106:
--

The other important thing is about the leveldbjni 
[https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
 spark depends on leveldbjni-all-1.8 
[https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
 we can see there is no arm64 supporting. So we build an arm64 supporting 
release of leveldbjni see 
[https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
 but we can't modified the spark pom.xml directly with something like 
'property'/'profile' to choose correct jar package on arm or x86 platform, 
because spark depends on some hadoop packages like hadoop-hdfs, the packages 
depend on leveldbjni-all-1.8 too, unless hadoop release with new arm supporting 
leveldbjni jar. Now we download the leveldbjni-al-1.8 of openlabtesting and 
'mvn install' to use it when arm testing for spark.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933001#comment-16933001
 ] 

huangtianhua edited comment on SPARK-29106 at 9/19/19 2:34 AM:
---

[~dongjoon], thanks :)

Till now we made two arm test periodic jobs for spark in OpenLab, one is based 
on master with hadoop 2.7(similar with QA test of amplab jenkins), other one is 
based on a new branch which we made on date 09-09, see  
[http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
  and 
[http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64,|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
 I think we only have to care about the first one when integrate arm test with 
amplab jenkins. In fact we have took test for k8s on arm, see 
[https://github.com/theopenlab/spark/pull/17], maybe we can integrate it later. 
 And we plan test on other stable branches too, and we can integrate them to 
amplab when they are ready.

We have offered an arm instance and sent the infos to shane knapp, thanks shane 
to add the first arm job to amplab jenkins :) 

ps: the issues found and fixed list:
 SPARK-28770
 [https://github.com/apache/spark/pull/25673]
  
 SPARK-28519
 [https://github.com/apache/spark/pull/25279]
  
 SPARK-28433
 [https://github.com/apache/spark/pull/25186]
  
  


was (Author: huangtianhua):
[~dongjoon], thanks :)

Till now we made two arm test periodic jobs for spark in OpenLab, one is based 
on master with hadoop 2.7, other one is based on a new branch which we made on 
date 09-09, see  
[http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
  and 
[http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64,|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
 I think we only have to care about the first one when integrate arm test with 
amplab jenkins. In fact we have took test for k8s on arm, see 
[https://github.com/theopenlab/spark/pull/17], maybe we can integrate it later. 
 And we plan test on other stable branches too, and we can integrate them to 
amplab when they are ready.

We have offered an arm instance and sent the infos to shane knapp, thanks shane 
to add the first arm job to amplab jenkins :) 

ps: the issues found and fixed list:
 SPARK-28770
[https://github.com/apache/spark/pull/25673]
  
 SPARK-28519
 [https://github.com/apache/spark/pull/25279]
  
 SPARK-28433
 [https://github.com/apache/spark/pull/25186]
  
  

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933001#comment-16933001
 ] 

huangtianhua edited comment on SPARK-29106 at 9/19/19 2:31 AM:
---

[~dongjoon], thanks :)

Till now we made two arm test periodic jobs for spark in OpenLab, one is based 
on master with hadoop 2.7, other one is based on a new branch which we made on 
date 09-09, see  
[http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
  and 
[http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64,|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
 I think we only have to care about the first one when integrate arm test with 
amplab jenkins. In fact we have took test for k8s on arm, see 
[https://github.com/theopenlab/spark/pull/17], maybe we can integrate it later. 
 And we plan test on other stable branches too, and we can integrate them to 
amplab when they are ready.

We have offered an arm instance and sent the infos to shane knapp, thanks shane 
to add the first arm job to amplab jenkins :) 

ps: the issues found and fixed list:
 SPARK-28770
[https://github.com/apache/spark/pull/25673]
  
 SPARK-28519
 [https://github.com/apache/spark/pull/25279]
  
 SPARK-28433
 [https://github.com/apache/spark/pull/25186]
  
  


was (Author: huangtianhua):
[~dongjoon], thanks :)

Till now we made two arm test periodic jobs for spark in OpenLab, one is based 
on master with hadoop 2.7, other one is based on a new branch which we made on 
date 09-09, see  
[http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
  and 
[http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64,|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
 I think we only have to care about the first one when integrate arm test with 
amplab jenkins. In fact we have took test for k8s on arm, see 
[https://github.com/theopenlab/spark/pull/17], maybe we can integrate it later. 
 And we plan test on other stable branches too, and we can integrate them to 
amplab when they are ready.

We have offered an arm instance and sent the infos to shane knapp, thanks shane 
to add the first arm job to amplab jenkins :) 

ps: the issues found and fixed list:
[{color:#172b4d}{color}|https://github.com/apache/spark/pull/25186]SPARK-28770
[|https://github.com/apache/spark/pull/25186] 
[https://github.com/apache/spark/pull/25673]
 
SPARK-28519
[https://github.com/apache/spark/pull/25279]
 
SPARK-28433
[https://github.com/apache/spark/pull/25186]
 
 

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread huangtianhua (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933001#comment-16933001
 ] 

huangtianhua commented on SPARK-29106:
--

[~dongjoon], thanks :)

Till now we made two arm test periodic jobs for spark in OpenLab, one is based 
on master with hadoop 2.7, other one is based on a new branch which we made on 
date 09-09, see  
[http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
  and 
[http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64,|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
 I think we only have to care about the first one when integrate arm test with 
amplab jenkins. In fact we have took test for k8s on arm, see 
[https://github.com/theopenlab/spark/pull/17], maybe we can integrate it later. 
 And we plan test on other stable branches too, and we can integrate them to 
amplab when they are ready.

We have offered an arm instance and sent the infos to shane knapp, thanks shane 
to add the first arm job to amplab jenkins :) 

ps: the issues found and fixed list:
[{color:#172b4d}{color}|https://github.com/apache/spark/pull/25186]SPARK-28770
[|https://github.com/apache/spark/pull/25186] 
[https://github.com/apache/spark/pull/25673]
 
SPARK-28519
[https://github.com/apache/spark/pull/25279]
 
SPARK-28433
[https://github.com/apache/spark/pull/25186]
 
 

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932997#comment-16932997
 ] 

Nicholas Chammas commented on SPARK-29102:
--

{quote}It duplicately decompresses and each map task process what they want. 
And then, each map task stops decompressing if they processes what they want.
{quote}
Yup, that's what I was suggesting in this issue. Glad some folks have already 
tried that out. Hopefully, I'll get lucky and 
{{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} will just work for me.
{quote}We could resolve this JIRA but if you feel like it's still feasible, I 
don't mind leaving this JIRA open.
{quote}
I've resolved it for now as "Won't Fix". I'll report back here if the solution 
you pointed me to works.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Nicholas Chammas (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-29102.
--
Resolution: Won't Fix

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27891) Long running spark jobs fail because of HDFS delegation token expires

2019-09-18 Thread avinash v kodikal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932981#comment-16932981
 ] 

avinash v kodikal commented on SPARK-27891:
---

[~vanzin] - DId you get a chance to look at the latest logs? Please let us know 
if this can be addressed in the ongoing spark release

> Long running spark jobs fail because of HDFS delegation token expires
> -
>
> Key: SPARK-27891
> URL: https://issues.apache.org/jira/browse/SPARK-27891
> Project: Spark
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 2.0.1, 2.1.0, 2.3.1, 2.4.1
>Reporter: hemshankar sahu
>Priority: Critical
> Attachments: application_1559242207407_0001.log, 
> spark_2.3.1_failure.log
>
>
> When the spark job runs on a secured cluster for longer then time that is 
> mentioned in the dfs.namenode.delegation.token.renew-interval property of 
> hdfs-site.xml the spark job fails. ** 
> Following command was used to submit the spark job
> bin/spark-submit --principal acekrbuser --keytab ~/keytabs/acekrbuser.keytab 
> --master yarn --deploy-mode cluster examples/src/main/python/wordcount.py 
> /tmp/ff1.txt
>  
> Application Logs attached
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932968#comment-16932968
 ] 

Hyukjin Kwon edited comment on SPARK-29102 at 9/19/19 1:27 AM:
---

Yea, that _might_ work. It's been too long since I investigated that so I don't 
even remember if I actually tested or not.
BTW, {{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} actually decompresses 
the same files multiple times duplicately.

It duplicately decompresses and each map task process what they want. And then, 
each map task stops decompressing if they processes what they want.
So. theoretically the map tasks that has to process the last block of the 
gzipped file has to decompress whole file.

Seems like there's performance advantage nevertheless.

Another way:

{quote}
 it has to make an index after scanning once first
{quote}

To fully allow partial decompress (IIRC .. a long ago, I worked on this way), 
it has to make a separate index.
I tried to: scan once first, make a separate index file and decompress it 
partially but IIRC performance was poor (as much as 
{{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} _IIRC_).


was (Author: hyukjin.kwon):
Yea, that _might_ work. It's been too long since I investigated that so I don't 
even remember if I actually tested or not.
BTW, {{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} actually decompresses 
the same files multiple times duplicately.

It duplicatedly decompresses and each map task processes what they want. And 
then, each map task stops decompressing if they processes what they one.
So. theorically the map tasks that has to process the last block of the gzipped 
file has to decompress whole file.

Seems like there's performance advantage nevertheless.

Another way:

{quote}
 it has to make an index after scanning once first
{quote}

To fully allow partial decompress (IIRC .. a long ago, I worked on this way), 
it has to make a separate index.
I tried to: scan once first, make a separate index file and decompress it 
partially but IIRC performance was poor (as much as 
{{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} _IIRC_).

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932970#comment-16932970
 ] 

Hyukjin Kwon commented on SPARK-29102:
--

So .. the workaround _might_ be 
{{nl.basjes.hadoop.io.compress.SplittableGzipCodec}}. Otherwise, it is pretty 
much difficult to do it. IIRC, there are some issues open in HDFS side, that 
has been not resolved for a long long time.

We could resolve this JIRA but if you feel like it's still feasible, I don't 
mind leaving this JIRA open.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29166) Add a parameter to limit the number of dynamic partitions for data source table

2019-09-18 Thread Lantao Jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-29166:
---
Description: 
Dynamic partition in Hive table has some restrictions to limit the max number 
of partitions. See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts

It's very useful to prevent to create mistake partitions like ID. Also it can 
protect the NameNode from mass RPC calls of creating.

Data source table also needs similar limitation.

  was:
Dynamic partition in Hive table has some restrictions to limit the max number 
of partitions. See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts

It's very useful to prevent to create mistake partitions like ID. Also it can 
protect the NameNode from mass RPC calls of creating.


> Add a parameter to limit the number of dynamic partitions for data source 
> table
> ---
>
> Key: SPARK-29166
> URL: https://issues.apache.org/jira/browse/SPARK-29166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Dynamic partition in Hive table has some restrictions to limit the max number 
> of partitions. See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts
> It's very useful to prevent to create mistake partitions like ID. Also it can 
> protect the NameNode from mass RPC calls of creating.
> Data source table also needs similar limitation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932968#comment-16932968
 ] 

Hyukjin Kwon commented on SPARK-29102:
--

Yea, that _might_ work. It's been too long since I investigated that so I don't 
even remember if I actually tested or not.
BTW, {{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} actually decompresses 
the same files multiple times duplicately.

It duplicatedly decompresses and each map task processes what they want. And 
then, each map task stops decompressing if they processes what they one.
So. theorically the map tasks that has to process the last block of the gzipped 
file has to decompress whole file.

Seems like there's performance advantage nevertheless.

Another way:

{quote}
 it has to make an index after scanning once first
{quote}

To fully allow partial decompress (IIRC .. a long ago, I worked on this way), 
it has to make a separate index.
I tried to: scan once first, make a separate index file and decompress it 
partially but IIRC performance was poor (as much as 
{{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} _IIRC_).

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29166) Add a parameter to limit the number of dynamic partitions for data source table

2019-09-18 Thread Lantao Jin (Jira)

Lantao Jin created SPARK-29166:
--

 Summary: Add a parameter to limit the number of dynamic partitions 
for data source table
 Key: SPARK-29166
 URL: https://issues.apache.org/jira/browse/SPARK-29166
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Lantao Jin


Dynamic partition in Hive table has some restrictions to limit the max number 
of partitions. See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts

It's very useful to prevent to create mistake partitions like ID. Also it can 
protect the NameNode from mass RPC calls of creating.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28683) Upgrade Scala to 2.12.10

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28683:
--
Fix Version/s: 2.4.5

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1=1543097847070=1564631199344=2.12.x=All=HotScalacBenchmark.compile=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28683) Upgrade Scala to 2.12.10

2019-09-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932962#comment-16932962
 ] 

Dongjoon Hyun commented on SPARK-28683:
---

This is backported to `branch-2.4` for Apache Spark 2.4.5 via 
https://github.com/apache/spark/pull/25839

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1=1543097847070=1564631199344=2.12.x=All=HotScalacBenchmark.compile=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28683) Upgrade Scala to 2.12.10

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28683:
--
Affects Version/s: 2.4.5

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1=1543097847070=1564631199344=2.12.x=All=HotScalacBenchmark.compile=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29141) Use SqlBasedBenchmark in SQL benchmarks

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29141.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25828
[https://github.com/apache/spark/pull/25828]

> Use SqlBasedBenchmark in SQL benchmarks
> ---
>
> Key: SPARK-29141
> URL: https://issues.apache.org/jira/browse/SPARK-29141
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The ticket is created as the response to the [~dongjoon]'s comment: 
> https://github.com/apache/spark/pull/25772#discussion_r323891916 . Purpose of 
> this is to extend one trait SqlBasedBenchmark by all SQL-related benchmarks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29165) Set log level of log generated code as ERROR in case of compile error on generated code in UT

2019-09-18 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29165:


 Summary: Set log level of log generated code as ERROR in case of 
compile error on generated code in UT
 Key: SPARK-29165
 URL: https://issues.apache.org/jira/browse/SPARK-29165
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This would help to investigate compilation issue of generated code easier, as 
currently we got exception message of line number but there's no generated code 
being logged actually (as in most cases of UT the threshold of log level is at 
least WARN).

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932953#comment-16932953
 ] 

Nicholas Chammas commented on SPARK-29102:
--

Ah, thanks for the reference! So if I'm just trying to read gzipped CSV or JSON 
text files, then  {{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} may 
already provide a solution today, correct?

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2019-09-18 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-29157:
--
Description: *strong text*After SPARK-28612 is committed, we need to add 
the corresponding PySpark API.  (was: After SPARK-28612 is committed, we need 
to add the corresponding PySpark API.)

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> *strong text*After SPARK-28612 is committed, we need to add the corresponding 
> PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2019-09-18 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-29157:
--
Description: After SPARK-28612 is committed, we need to add the 
corresponding PySpark API.  (was: *strong text*After SPARK-28612 is committed, 
we need to add the corresponding PySpark API.)

> DataSourceV2: Add DataFrameWriterV2 to Python API
> -
>
> Key: SPARK-29157
> URL: https://issues.apache.org/jira/browse/SPARK-29157
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Major
>
> After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29162) Simplify NOT(isnull(x)) and NOT(isnotnull(x))

2019-09-18 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-29162:
---
Description: 
I propose the following expression rewrite optimizations:

{code}
NOT isnull(x) -> isnotnull(x)
NOT isnotnull(x)  -> isnull(x)
{code}

This might seem contrived, but I saw negated versions of these expressions 
appear in a user-written query after that query had undergone optimization. For 
example:

{code}
spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), ("false", 
false), ("null", null))).write.parquet("/tmp/bools")
spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain

spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
false)").explain(true)
== Parsed Logical Plan ==
'Filter NOT ('isnull('_2) OR ('_2 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Analyzed Logical Plan ==
_1: string, _2: boolean
Filter NOT (isnull(_2#5) OR (_2#5 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Optimized Logical Plan ==
Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Physical Plan ==
*(1) Project [_1#4, _2#5]
+- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
   +- *(1) ColumnarToRow
  +- BatchScan[_1#4, _2#5] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean>
{code}

This rewrite is also useful for query canonicalization.

  was:
I propose the following expression rewrite optimizations:

{code}
NOT isnull(x) -> isnotnull(x)
NOT isnotnull(x)  -> isnull(x)
{code}

This might seem contrived, but I saw negated versions of these expressions 
appear in a user-written query after that query had undergone optimization. For 
example:

{code}
spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), ("false", 
false), ("null", null))).write.parquet("/tmp/bools")
spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain

spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
false)").explain(true)
== Parsed Logical Plan ==
'Filter NOT ('isnull('_2) OR ('_2 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Analyzed Logical Plan ==
_1: string, _2: boolean
Filter NOT (isnull(_2#5) OR (_2#5 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Optimized Logical Plan ==
Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Physical Plan ==
*(1) Project [_1#4, _2#5]
+- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
   +- *(1) ColumnarToRow
  +- BatchScan[_1#4, _2#5] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean>
{code}


> Simplify NOT(isnull(x)) and NOT(isnotnull(x))
> -
>
> Key: SPARK-29162
> URL: https://issues.apache.org/jira/browse/SPARK-29162
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I propose the following expression rewrite optimizations:
> {code}
> NOT isnull(x) -> isnotnull(x)
> NOT isnotnull(x)  -> isnull(x)
> {code}
> This might seem contrived, but I saw negated versions of these expressions 
> appear in a user-written query after that query had undergone optimization. 
> For example:
> {code}
> spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), 
> ("false", false), ("null", null))).write.parquet("/tmp/bools")
> spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
> false)").explain
> spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
> false)").explain(true)
> == Parsed Logical Plan ==
> 'Filter NOT ('isnull('_2) OR ('_2 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Analyzed Logical Plan ==
> _1: string, _2: boolean
> Filter NOT (isnull(_2#5) OR (_2#5 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Optimized Logical Plan ==
> Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
> +- RelationV2[_1#4, _2#5] parquet file:/tmp/bools
> == Physical Plan ==
> *(1) Project [_1#4, _2#5]
> +- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
>+- *(1) ColumnarToRow
>   +- BatchScan[_1#4, _2#5] ParquetScan Location: 
> InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean>
> {code}
> This rewrite is also useful for query canonicalization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Created] (SPARK-29164) Rewrite coalesce(boolean, booleanLit) as boolean expression

2019-09-18 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-29164:
--

 Summary: Rewrite coalesce(boolean, booleanLit) as boolean 
expression
 Key: SPARK-29164
 URL: https://issues.apache.org/jira/browse/SPARK-29164
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


I propose the following expression rewrite optimizations:
{code:java}
coalesce(x: Boolean, true)  -> x or isnull(x)
coalesce(x: Boolean, false) -> x or isnotnull(x){code}
This pattern appears when translating Dataset filters on {{Option[Boolean]}} 
columns: we might have a typed Dataset filter which looks like
{code:java}
 .filter(_.boolCol.getOrElse(DEFAULT_VALUE)){code}
and the most idiomatic, user-friendly translation of this in Catalyst is to use 
{{coalesce()}}. However, the {{coalesce()}} form of this expression is not 
eligible for Parquet / data source filter pushdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29163) Provide a mixin to simplify HadoopConf access patterns in DataSource V2

2019-09-18 Thread holdenk (Jira)

holdenk created SPARK-29163:
---

 Summary: Provide a mixin to simplify HadoopConf access patterns in 
DataSource V2
 Key: SPARK-29163
 URL: https://issues.apache.org/jira/browse/SPARK-29163
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: holdenk
Assignee: holdenk


Since many data sources need the hadoop config we should make an easy way for 
them to get access to it with minimal overhead (e.g. broadcasting + mixin).

 

TODO after SPARK-29158. Also look at DSV1 and see if there were any interesting 
hacks we did before to make this fast.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29162) Simplify NOT(isnull(x)) and NOT(isnotnull(x))

2019-09-18 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-29162:
--

 Summary: Simplify NOT(isnull(x)) and NOT(isnotnull(x))
 Key: SPARK-29162
 URL: https://issues.apache.org/jira/browse/SPARK-29162
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


I propose the following expression rewrite optimizations:

{code}
NOT isnull(x) -> isnotnull(x)
NOT isnotnull(x)  -> isnull(x)
{code}

This might seem contrived, but I saw negated versions of these expressions 
appear in a user-written query after that query had undergone optimization. For 
example:

{code}
spark.createDataset(Seq[(String, java.lang.Boolean)](("true", true), ("false", 
false), ("null", null))).write.parquet("/tmp/bools")
spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == false)").explain

spark.read.parquet("/tmp/bools").where("not(isnull(_2) or _2 == 
false)").explain(true)
== Parsed Logical Plan ==
'Filter NOT ('isnull('_2) OR ('_2 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Analyzed Logical Plan ==
_1: string, _2: boolean
Filter NOT (isnull(_2#5) OR (_2#5 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Optimized Logical Plan ==
Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
+- RelationV2[_1#4, _2#5] parquet file:/tmp/bools

== Physical Plan ==
*(1) Project [_1#4, _2#5]
+- *(1) Filter ((isnotnull(_2#5) AND NOT isnull(_2#5)) AND NOT (_2#5 = false))
   +- *(1) ColumnarToRow
  +- BatchScan[_1#4, _2#5] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/bools], ReadSchema: struct<_1:string,_2:boolean>
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29161) Unify default wait time for LiveListenerBus.waitUntilEmpty

2019-09-18 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29161:


 Summary: Unify default wait time for LiveListenerBus.waitUntilEmpty
 Key: SPARK-29161
 URL: https://issues.apache.org/jira/browse/SPARK-29161
 Project: Spark
  Issue Type: Improvement
  Components: DStreams, Spark Core, SQL, Tests
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue tracks the effort of following up [review 
comment|https://github.com/apache/spark/pull/25706#discussion_r321923311]. 
Quoting here:

 {quote}
On a side note, the timeout for this method is hardcoded to a bunch of 
different arbitrary values in so many different places, that it may be good at 
some point to just have a default value in LiveListenerBus. I doubt any test 
code actually depends on a specific timeout here.
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932923#comment-16932923
 ] 

Hyukjin Kwon commented on SPARK-29078:
--

ping [~misutoth]

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29099) org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932922#comment-16932922
 ] 

Hyukjin Kwon commented on SPARK-29099:
--

Seems like we use "UNKNOWN" in that case:

1. 
https://github.com/apache/spark/commit/d4a277f0ce2d6e1832d87cae8faec38c5bc730f4
2. 
https://github.com/apache/spark/commit/4559a82a1de289093064490ef2d39c3c535fb3d4

> org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime is not set
> 
>
> Key: SPARK-29099
> URL: https://issues.apache.org/jira/browse/SPARK-29099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Shixiong Zhu
>Priority: Major
>
> I noticed that 
> "org.apache.spark.sql.catalyst.catalog.CatalogTable.lastAccessTime" is always 
> 0 in my environment. Looks like Spark never updates this field in metastore 
> when reading a table. This is fine considering the cost to update it when 
> reading a table is high.
> However, "Last Access" in "describe extended" always shows "Thu Jan 01 
> 00:00:00 UTC 1970" and this is confusing. Can we show something alternative 
> to indicate it's not set?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29102) Read gzipped file into multiple partitions without full gzip expansion on a single-node

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932921#comment-16932921
 ] 

Hyukjin Kwon commented on SPARK-29102:
--

Hm, I took a look for this one few years ago and was pretty difficult to do. To 
allow partial decompression, IIRC,

1. it has to make an index after scanning once first
2. duplicately decompress (see https://github.com/nielsbasjes/splittablegzip)

I think this is already possible if the codec is properly registered in HDFS, 
and {{compression}} option is set to, say, 
{{nl.basjes.hadoop.io.compress.SplittableGzipCodec}}.
For other file formats like Parquet or ORC, they have their own compression 
codec. So if they don't support it,  we can't do it.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---
>
> Key: SPARK-29102
> URL: https://issues.apache.org/jira/browse/SPARK-29102
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29145) Spark SQL cannot handle "NOT IN" condition when using "JOIN"

2019-09-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29145:
-
Description: 
sample sql: 

{code}
spark.range(10).createOrReplaceTempView("A")
spark.range(10).createOrReplaceTempView("B")
spark.range(10).createOrReplaceTempView("C")
sql("""select * from A inner join B on A.id=B.id and A.id NOT IN (select id 
from C)""")
{code}
 
{code}
org.apache.spark.sql.AnalysisException: Table or view not found: C; line 1 pos 
74;
'Project [*]
+- 'Join Inner, ((id#0L = id#2L) AND NOT id#0L IN (list#6 []))
   :  +- 'Project ['id]
   : +- 'UnresolvedRelation [C]
   :- SubqueryAlias `a`
   :  +- Range (0, 10, step=1, splits=Some(12))
   +- SubqueryAlias `b`
  +- Range (0, 10, step=1, splits=Some(12))

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:94)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:89)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:155)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:154)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:154)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:154)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:89)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:86)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:120)
...
{code}
 

  was:
sample sql: 

select * from A inner join B on A.id=B.id and A.id NOT IN (select id from C)

spark sql throw exception: table or view `C` not found

 

 


> Spark SQL cannot handle "NOT IN" condition when using "JOIN"
> 
>
> Key: SPARK-29145
> URL: https://issues.apache.org/jira/browse/SPARK-29145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Dezhi Cai
>Priority: Minor
>
> sample sql: 
> {code}
> spark.range(10).createOrReplaceTempView("A")
> spark.range(10).createOrReplaceTempView("B")
> spark.range(10).createOrReplaceTempView("C")
> sql("""select * from A inner join B on A.id=B.id and A.id NOT IN (select id 
> from C)""")
> {code}
>  
> {code}
> org.apache.spark.sql.AnalysisException: Table or view not found: C; line 1 
> pos 74;
> 'Project [*]
> +- 'Join Inner, ((id#0L = id#2L) AND NOT id#0L IN (list#6 []))
>:  +- 'Project ['id]
>: +- 'UnresolvedRelation [C]
>:- SubqueryAlias `a`
>:  +- Range (0, 10, step=1, splits=Some(12))
>+- SubqueryAlias `b`
>   +- Range (0, 10, step=1, splits=Some(12))
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:155)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:154)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:154)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:86)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:120)
> ...
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29147) Spark doesn't use shuffleHashJoin as expected

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932915#comment-16932915
 ] 

Hyukjin Kwon commented on SPARK-29147:
--

[~ayudovin], let's interact with the mailing list first before filing an issue. 
questions should go there.

> Spark doesn't use shuffleHashJoin as expected
> -
>
> Key: SPARK-29147
> URL: https://issues.apache.org/jira/browse/SPARK-29147
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Artsiom Yudovin
>Priority: Major
>
> I run the following code:
> {code:java}
> val spark = SparkSession.builder()
>   .appName("ShuffleHashJoin")
>   .master("local[*]")
>   .config("spark.sql.autoBroadcastJoinThreshold", 0)
>   .config("spark.sql.join.preferSortMergeJoin", value = false)
>   .getOrCreate()
> import spark.implicits._
> val dataset = Seq(
>   ("1", "playing"),
>   ("2", "with"),
>   ("3", "ShuffledHashJoinExec")
> ).toDF("id", "token")
> val dataset1 = Seq(
>   ("1", "playing"),
>   ("2", "with"),
>   ("3", "ShuffledHashJoinExec")
> ).toDF("id1", "token")
>   
>dataset.join(dataset1, $"id" === $"id1", "inner").foreach(t => println(t))
> {code}
> My expectation that Spark will use 'shuffleHashJoin' but I see in SparkUI and 
> explain() that Spark uses 'sortMergeJoin'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29146) 'DataFrame' object has no attribute 'copy'

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932917#comment-16932917
 ] 

Hyukjin Kwon commented on SPARK-29146:
--

Can you show the reproducer please?

> 'DataFrame' object has no attribute 'copy'
> --
>
> Key: SPARK-29146
> URL: https://issues.apache.org/jira/browse/SPARK-29146
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.3
>Reporter: Gourav Mehta
>Priority: Major
>
> 'DataFrame' object has no attribute 'copy' error while executing the code (
> cvModel = crossval.fit(trainingData))
>  
> !image-2019-09-18-16-16-07-690.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29147) Spark doesn't use shuffleHashJoin as expected

2019-09-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29147.
--
Resolution: Invalid

> Spark doesn't use shuffleHashJoin as expected
> -
>
> Key: SPARK-29147
> URL: https://issues.apache.org/jira/browse/SPARK-29147
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Artsiom Yudovin
>Priority: Major
>
> I run the following code:
> {code:java}
> val spark = SparkSession.builder()
>   .appName("ShuffleHashJoin")
>   .master("local[*]")
>   .config("spark.sql.autoBroadcastJoinThreshold", 0)
>   .config("spark.sql.join.preferSortMergeJoin", value = false)
>   .getOrCreate()
> import spark.implicits._
> val dataset = Seq(
>   ("1", "playing"),
>   ("2", "with"),
>   ("3", "ShuffledHashJoinExec")
> ).toDF("id", "token")
> val dataset1 = Seq(
>   ("1", "playing"),
>   ("2", "with"),
>   ("3", "ShuffledHashJoinExec")
> ).toDF("id1", "token")
>   
>dataset.join(dataset1, $"id" === $"id1", "inner").foreach(t => println(t))
> {code}
> My expectation that Spark will use 'shuffleHashJoin' but I see in SparkUI and 
> explain() that Spark uses 'sortMergeJoin'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29147) Spark doesn't use shuffleHashJoin as expected

2019-09-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29147:
-
Priority: Major  (was: Critical)

> Spark doesn't use shuffleHashJoin as expected
> -
>
> Key: SPARK-29147
> URL: https://issues.apache.org/jira/browse/SPARK-29147
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Artsiom Yudovin
>Priority: Major
>
> I run the following code:
> {code:java}
> val spark = SparkSession.builder()
>   .appName("ShuffleHashJoin")
>   .master("local[*]")
>   .config("spark.sql.autoBroadcastJoinThreshold", 0)
>   .config("spark.sql.join.preferSortMergeJoin", value = false)
>   .getOrCreate()
> import spark.implicits._
> val dataset = Seq(
>   ("1", "playing"),
>   ("2", "with"),
>   ("3", "ShuffledHashJoinExec")
> ).toDF("id", "token")
> val dataset1 = Seq(
>   ("1", "playing"),
>   ("2", "with"),
>   ("3", "ShuffledHashJoinExec")
> ).toDF("id1", "token")
>   
>dataset.join(dataset1, $"id" === $"id1", "inner").foreach(t => println(t))
> {code}
> My expectation that Spark will use 'shuffleHashJoin' but I see in SparkUI and 
> explain() that Spark uses 'sortMergeJoin'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29156) Hive has appending data as part of cdc, In write mode we should be able to write only changes captured to teradata or datasource.

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932913#comment-16932913
 ] 

Hyukjin Kwon commented on SPARK-29156:
--

Can you clarify it and show the reproducer please? I cannot fully understand 
what this JIRA means.

> Hive has appending data as part of cdc, In write mode we should be able to 
> write only changes captured to teradata or datasource.
> -
>
> Key: SPARK-29156
> URL: https://issues.apache.org/jira/browse/SPARK-29156
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.4.3
> Environment: spark 2.3.2
> dataiku
> aws emr
>Reporter: raju
>Priority: Major
>  Labels: patch
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In general change data captures are appended to hive tables. We have scenario 
> where connecting to teradata/ datasource. Only changes captured as updates 
> should be able to write in data source. We are unable to do same by over 
> write and append modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29156) Hive has appending data as part of cdc, In write mode we should be able to write only changes captured to teradata or datasource.

2019-09-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29156:
-
Target Version/s:   (was: 2.4.4)

> Hive has appending data as part of cdc, In write mode we should be able to 
> write only changes captured to teradata or datasource.
> -
>
> Key: SPARK-29156
> URL: https://issues.apache.org/jira/browse/SPARK-29156
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.4.3
> Environment: spark 2.3.2
> dataiku
> aws emr
>Reporter: raju
>Priority: Major
>  Labels: patch
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In general change data captures are appended to hive tables. We have scenario 
> where connecting to teradata/ datasource. Only changes captured as updates 
> should be able to write in data source. We are unable to do same by over 
> write and append modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29160) Event log file is written without specific charset which should be ideally UTF-8

2019-09-18 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932909#comment-16932909
 ] 

Jungtaek Lim commented on SPARK-29160:
--

I'll raise a patch today. It might need to have config to "use default charset" 
when end users suffer from reading old log, but then writer should still use 
default charset to let reader reads both old log and new log, so it's somewhat 
messed up. I'll not apply this to the PR, but will add a comment instead.

> Event log file is written without specific charset which should be ideally 
> UTF-8
> 
>
> Key: SPARK-29160
> URL: https://issues.apache.org/jira/browse/SPARK-29160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue is from observation by [~vanzin] : 
> [https://github.com/apache/spark/pull/25670#discussion_r325383512]
> Quoting his comment here:
> {quote}
> This is a long standing bug in the original code, but this should be 
> explicitly setting the charset to UTF-8 (using new PrintWriter(new 
> OutputStreamWriter(...)).
> The reader side should too, although doing that now could potentially break 
> old logs... we should open a bug for this.
> {quote}
> While EventLoggingListener writes to UTF-8 properly when converting to byte[] 
> before writing, it doesn't deal with charset in logEvent().
> It should be fixed, but as Marcelo said, we also need to be aware of 
> potential broken of reading old logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29160) Event log file is written without specific charset which should be ideally UTF-8

2019-09-18 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932901#comment-16932901
 ] 

Jungtaek Lim commented on SPARK-29160:
--

While I just added 3.0.0 as Affected Version, all versions we support might be 
affected.

> Event log file is written without specific charset which should be ideally 
> UTF-8
> 
>
> Key: SPARK-29160
> URL: https://issues.apache.org/jira/browse/SPARK-29160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue is from observation by [~vanzin] : 
> [https://github.com/apache/spark/pull/25670#discussion_r325383512]
> Quoting his comment here:
> {quote}
> This is a long standing bug in the original code, but this should be 
> explicitly setting the charset to UTF-8 (using new PrintWriter(new 
> OutputStreamWriter(...)).
> The reader side should too, although doing that now could potentially break 
> old logs... we should open a bug for this.
> {quote}
> While EventLoggingListener writes to UTF-8 properly when converting to byte[] 
> before writing, it doesn't deal with charset in logEvent().
> It should be fixed, but as Marcelo said, we also need to be aware of 
> potential broken of reading old logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29160) Event log file is written without specific charset which should be ideally UTF-8

2019-09-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-29160:
-
Description: 
This issue is from observation by [~vanzin] : 
[https://github.com/apache/spark/pull/25670#discussion_r325383512]

Quoting his comment here:

{quote}
This is a long standing bug in the original code, but this should be explicitly 
setting the charset to UTF-8 (using new PrintWriter(new 
OutputStreamWriter(...)).

The reader side should too, although doing that now could potentially break old 
logs... we should open a bug for this.
{quote}

While EventLoggingListener writes to UTF-8 properly when converting to byte[] 
before writing, it doesn't deal with charset in logEvent().

It should be fixed, but as Marcelo said, we also need to be aware of potential 
broken of reading old logs.

  was:
This issue is from observation by [~vanzin] : 
[https://github.com/apache/spark/pull/25670#discussion_r325383512]

Quoting his comment here:
{noformat}
This is a long standing bug in the original code, but this should be explicitly 
setting the charset to UTF-8 (using new PrintWriter(new 
OutputStreamWriter(...)).

The reader side should too, although doing that now could potentially break old 
logs... we should open a bug for this.{noformat}
While EventLoggingListener writes to UTF-8 properly when converting to byte[] 
before writing, it doesn't deal with charset in logEvent().

It should be fixed, but as Marcelo said, we also need to be aware of potential 
broken of reading old logs.


> Event log file is written without specific charset which should be ideally 
> UTF-8
> 
>
> Key: SPARK-29160
> URL: https://issues.apache.org/jira/browse/SPARK-29160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue is from observation by [~vanzin] : 
> [https://github.com/apache/spark/pull/25670#discussion_r325383512]
> Quoting his comment here:
> {quote}
> This is a long standing bug in the original code, but this should be 
> explicitly setting the charset to UTF-8 (using new PrintWriter(new 
> OutputStreamWriter(...)).
> The reader side should too, although doing that now could potentially break 
> old logs... we should open a bug for this.
> {quote}
> While EventLoggingListener writes to UTF-8 properly when converting to byte[] 
> before writing, it doesn't deal with charset in logEvent().
> It should be fixed, but as Marcelo said, we also need to be aware of 
> potential broken of reading old logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29160) Event log file is written without specific charset which should be ideally UTF-8

2019-09-18 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-29160:


 Summary: Event log file is written without specific charset which 
should be ideally UTF-8
 Key: SPARK-29160
 URL: https://issues.apache.org/jira/browse/SPARK-29160
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue is from observation by [~vanzin] : 
[https://github.com/apache/spark/pull/25670#discussion_r325383512]

Quoting his comment here:
{noformat}
This is a long standing bug in the original code, but this should be explicitly 
setting the charset to UTF-8 (using new PrintWriter(new 
OutputStreamWriter(...)).

The reader side should too, although doing that now could potentially break old 
logs... we should open a bug for this.{noformat}
While EventLoggingListener writes to UTF-8 properly when converting to byte[] 
before writing, it doesn't deal with charset in logEvent().

It should be fixed, but as Marcelo said, we also need to be aware of potential 
broken of reading old logs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29159) Increase ReservedCodeCacheSize to 1G

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29159:
--
Summary: Increase ReservedCodeCacheSize to 1G  (was: Increase CodeCacheSize 
to 1G)

> Increase ReservedCodeCacheSize to 1G
> 
>
> Key: SPARK-29159
> URL: https://issues.apache.org/jira/browse/SPARK-29159
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932890#comment-16932890
 ] 

Hyukjin Kwon commented on SPARK-29042:
--

Usually I only set "Fix Version/s" as that's what the merge script does.
But I think it can be legitimate to set "Affects Version/s" too.

> Sampling-based RDD with unordered input should be INDETERMINATE
> ---
>
> Key: SPARK-29042
> URL: https://issues.apache.org/jira/browse/SPARK-29042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> We have found and fixed the correctness issue when RDD output is 
> INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is 
> order sensitive to its input. A sampling-based RDD with unordered input, 
> should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29159) Increase CodeCacheSize to 1G

2019-09-18 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-29159:
-

 Summary: Increase CodeCacheSize to 1G
 Key: SPARK-29159
 URL: https://issues.apache.org/jira/browse/SPARK-29159
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29158) Expose SerializableConfiguration for DSv2

2019-09-18 Thread holdenk (Jira)

holdenk created SPARK-29158:
---

 Summary: Expose SerializableConfiguration for DSv2
 Key: SPARK-29158
 URL: https://issues.apache.org/jira/browse/SPARK-29158
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: holdenk
Assignee: holdenk


Since we use it frequently inside of our own DataSourceV2 implementations (13 
times from `

 grep -r broadcastedConf ./sql/core/src/ |grep val |wc -l`

) we should expose the SerializableConfiguration for DSv2 dev work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29157) DataSourceV2: Add DataFrameWriterV2 to Python API

2019-09-18 Thread Ryan Blue (Jira)

Ryan Blue created SPARK-29157:
-

 Summary: DataSourceV2: Add DataFrameWriterV2 to Python API
 Key: SPARK-29157
 URL: https://issues.apache.org/jira/browse/SPARK-29157
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


After SPARK-28612 is committed, we need to add the corresponding PySpark API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2019-09-18 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932833#comment-16932833
 ] 

koert kuipers commented on SPARK-27665:
---

oh wait i didnt realize there is a setting 
spark.shuffle.useOldFetchProtocol 
never mind! i will try that

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/browse/SPARK-27665
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks 
> protocol to describe the fetch request for shuffle blocks, and it causes the 
> extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very 
> awkward. We need a new protocol only for shuffle blocks fetcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2019-09-18 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932823#comment-16932823
 ] 

koert kuipers commented on SPARK-27665:
---

i am a little nervous that this got merged into master without resolving the 
blocker SPARK-27780

currently this means spark 3.x will not be able to support dynamic allocation 
at all on yarn clusters that have spark 2 shuffle managers installed, which is 
all our client clusters pretty much.

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/browse/SPARK-27665
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks 
> protocol to describe the fetch request for shuffle blocks, and it causes the 
> extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very 
> awkward. We need a new protocol only for shuffle blocks fetcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28683) Upgrade Scala to 2.12.10

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28683.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25404
[https://github.com/apache/spark/pull/25404]

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1=1543097847070=1564631199344=2.12.x=All=HotScalacBenchmark.compile=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29082) Spark driver cannot start with only delegation tokens

2019-09-18 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-29082:
--

Assignee: Marcelo Vanzin

> Spark driver cannot start with only delegation tokens
> -
>
> Key: SPARK-29082
> URL: https://issues.apache.org/jira/browse/SPARK-29082
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
>
> If you start a Spark application with just delegation tokens, it fails. For 
> example, from an Oozie launch, you see things like this (line numbers may be 
> different):
> {noformat}
> No child hadoop job is executed.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.runActionMain(LauncherAM.java:410)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.access$300(LauncherAM.java:55)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$2.run(LauncherAM.java:223)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.run(LauncherAM.java:217)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$1.run(LauncherAM.java:153)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.main(LauncherAM.java:141)
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: hrt_qa javax.security.auth.login.LoginException: Unable 
> to obtain password from user
> at 
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
> at 
> org.apache.hadoop.security.UserGroupInformation.getUGIFromTicketCache(UserGroupInformation.java:616)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.doLogin(HadoopDelegationTokenManager.scala:276)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:140)
> at 
> org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:305)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1057)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1178)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1584)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:860)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29082) Spark driver cannot start with only delegation tokens

2019-09-18 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-29082.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25805
[https://github.com/apache/spark/pull/25805]

> Spark driver cannot start with only delegation tokens
> -
>
> Key: SPARK-29082
> URL: https://issues.apache.org/jira/browse/SPARK-29082
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> If you start a Spark application with just delegation tokens, it fails. For 
> example, from an Oozie launch, you see things like this (line numbers may be 
> different):
> {noformat}
> No child hadoop job is executed.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.runActionMain(LauncherAM.java:410)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.access$300(LauncherAM.java:55)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$2.run(LauncherAM.java:223)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.run(LauncherAM.java:217)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$1.run(LauncherAM.java:153)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.main(LauncherAM.java:141)
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: hrt_qa javax.security.auth.login.LoginException: Unable 
> to obtain password from user
> at 
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
> at 
> org.apache.hadoop.security.UserGroupInformation.getUGIFromTicketCache(UserGroupInformation.java:616)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.doLogin(HadoopDelegationTokenManager.scala:276)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:140)
> at 
> org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:305)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1057)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1178)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1584)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:860)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28683) Upgrade Scala to 2.12.10

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28683:
-

Assignee: Yuming Wang

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1=1543097847070=1564631199344=2.12.x=All=HotScalacBenchmark.compile=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-18 Thread Liang-Chi Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932784#comment-16932784
 ] 

Liang-Chi Hsieh commented on SPARK-29042:
-

[~hyukjin.kwon] Am I setting the fix versions and affects version correct after 
backport? Can you take a look? Thanks.

 

> Sampling-based RDD with unordered input should be INDETERMINATE
> ---
>
> Key: SPARK-29042
> URL: https://issues.apache.org/jira/browse/SPARK-29042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> We have found and fixed the correctness issue when RDD output is 
> INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is 
> order sensitive to its input. A sampling-based RDD with unordered input, 
> should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29042) Sampling-based RDD with unordered input should be INDETERMINATE

2019-09-18 Thread Liang-Chi Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-29042:

Fix Version/s: 2.4.5

> Sampling-based RDD with unordered input should be INDETERMINATE
> ---
>
> Key: SPARK-29042
> URL: https://issues.apache.org/jira/browse/SPARK-29042
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 2.4.5, 3.0.0
>
>
> We have found and fixed the correctness issue when RDD output is 
> INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is 
> order sensitive to its input. A sampling-based RDD with unordered input, 
> should be INDETERMINATE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29156) Hive has appending data as part of cdc, In write mode we should be able to write only changes captured to teradata or datasource.

2019-09-18 Thread raju (Jira)

raju created SPARK-29156:


 Summary: Hive has appending data as part of cdc, In write mode we 
should be able to write only changes captured to teradata or datasource.
 Key: SPARK-29156
 URL: https://issues.apache.org/jira/browse/SPARK-29156
 Project: Spark
  Issue Type: New Feature
  Components: Tests
Affects Versions: 2.4.3
 Environment: spark 2.3.2

dataiku

aws emr
Reporter: raju


In general change data captures are appended to hive tables. We have scenario 
where connecting to teradata/ datasource. Only changes captured as updates 
should be able to write in data source. We are unable to do same by over write 
and append modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2019-09-18 Thread Liang-Chi Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-22796.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25812
[https://github.com/apache/spark/pull/25812]

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22796) Add multiple column support to PySpark QuantileDiscretizer

2019-09-18 Thread Liang-Chi Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh reassigned SPARK-22796:
---

Assignee: Huaxin Gao

> Add multiple column support to PySpark QuantileDiscretizer
> --
>
> Key: SPARK-22796
> URL: https://issues.apache.org/jira/browse/SPARK-22796
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22390) Aggregate push down

2019-09-18 Thread holdenk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932739#comment-16932739
 ] 

holdenk commented on SPARK-22390:
-

Love to follow where this is going, especially if it gets broken into smaller 
pieces of work.

> Aggregate push down
> ---
>
> Key: SPARK-22390
> URL: https://issues.apache.org/jira/browse/SPARK-22390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29155) Support special date/timestamp values in the PostgreSQL dialect only

2019-09-18 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-29155:
--

 Summary: Support special date/timestamp values in the PostgreSQL 
dialect only
 Key: SPARK-29155
 URL: https://issues.apache.org/jira/browse/SPARK-29155
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The PR https://github.com/apache/spark/pull/25716 supported special timestamp 
values for all dialects. Need to hide this feature under the *config 
spark.sql.dialect* = PostgreSQL. See 
https://github.com/apache/spark/pull/25716#issuecomment-532514518



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28091) Extend Spark metrics system with user-defined metrics using executor plugins

2019-09-18 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-28091:
--

Assignee: Luca Canali

> Extend Spark metrics system with user-defined metrics using executor plugins
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for 
> user-defined metrics, extending Spark’s Dropwizard/Codahale metrics system.
> The original motivation of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes with this JIRA to add a metrics plugin system which is of 
> more flexible and general use.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also , useful to  monitor and troubleshoot Spark workloads. A typical 
> workflow is to sink the metrics to a storage system and build dashboards on 
> top of that.
> Highlights:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation builds on top of the work on Executor Plugin of 
> SPARK-24918 and builds on recent work on extending Spark executor metrics, 
> such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28091) Extend Spark metrics system with user-defined metrics using executor plugins

2019-09-18 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28091.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 24901
[https://github.com/apache/spark/pull/24901]

> Extend Spark metrics system with user-defined metrics using executor plugins
> 
>
> Key: SPARK-28091
> URL: https://issues.apache.org/jira/browse/SPARK-28091
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> This proposes to improve Spark instrumentation by adding a hook for 
> user-defined metrics, extending Spark’s Dropwizard/Codahale metrics system.
> The original motivation of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes with this JIRA to add a metrics plugin system which is of 
> more flexible and general use.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also , useful to  monitor and troubleshoot Spark workloads. A typical 
> workflow is to sink the metrics to a storage system and build dashboards on 
> top of that.
> Highlights:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation builds on top of the work on Executor Plugin of 
> SPARK-24918 and builds on recent work on extending Spark executor metrics, 
> such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2019-09-18 Thread Suchintak Patnaik (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik reopened SPARK-29058:
---

Though the workaround of caching the dataframe first and then using count() 
works well, that is not feasible if the base datasaet size is large.

The dataframe count should give the correct count after discarding the corrupt 
records.

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-09-18 Thread Jacob Niebloom (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932666#comment-16932666
 ] 

Jacob Niebloom commented on SPARK-25075:


I am a possible new contributor the Spark. Is there a way I can help on this?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29141) Use SqlBasedBenchmark in SQL benchmarks

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29141:
-

Assignee: Maxim Gekk

> Use SqlBasedBenchmark in SQL benchmarks
> ---
>
> Key: SPARK-29141
> URL: https://issues.apache.org/jira/browse/SPARK-29141
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Trivial
>
> The ticket is created as the response to the [~dongjoon]'s comment: 
> https://github.com/apache/spark/pull/25772#discussion_r323891916 . Purpose of 
> this is to extend one trait SqlBasedBenchmark by all SQL-related benchmarks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932646#comment-16932646
 ] 

Dongjoon Hyun commented on SPARK-29106:
---

I changed the affected version to 3.0.0 because this is a new feature at 
testing and it seems that there is only `master` branch testing.
- http://status.openlabtesting.org/builds?project=apache/spark

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29141) Use SqlBasedBenchmark in SQL benchmarks

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29141:
--
Priority: Minor  (was: Trivial)

> Use SqlBasedBenchmark in SQL benchmarks
> ---
>
> Key: SPARK-29141
> URL: https://issues.apache.org/jira/browse/SPARK-29141
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The ticket is created as the response to the [~dongjoon]'s comment: 
> https://github.com/apache/spark/pull/25772#discussion_r323891916 . Purpose of 
> this is to extend one trait SqlBasedBenchmark by all SQL-related benchmarks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29106:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Major
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29106) Add jenkins arm test for spark

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29106:
--
Priority: Minor  (was: Major)

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
>
> Add arm test jobs to amplab jenkins. OpenLab will offer arm instances to 
> amplab to support arm test for spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28208:
-

Assignee: Owen O'Malley

> When upgrading to ORC 1.5.6, the reader needs to be closed.
> ---
>
> Key: SPARK-28208
> URL: https://issues.apache.org/jira/browse/SPARK-28208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
>
> As part of the ORC 1.5.6 release, we optimized the common pattern of:
> {code:java}
> Reader reader = OrcFile.createReader(...);
> RecordReader rows = reader.rows(...);{code}
> which used to open one file handle in the Reader and a second one in the 
> RecordReader. Users were seeing this as a regression when moving from the old 
> Spark ORC reader via hive to the new native reader, because it opened twice 
> as many files on the NameNode.
> In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
> the Reader until it is either closed or a RecordReader is created from it. 
> This has cut down the number of file open requests on the NameNode by half in 
> typical spark applications. (Hive's ORC code avoided this problem by putting 
> the file footer in to the input splits, but that has other problems.)
> To get the new optimization without leaking file handles, Spark needs to be 
> close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28208) When upgrading to ORC 1.5.6, the reader needs to be closed.

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28208.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25006
[https://github.com/apache/spark/pull/25006]

> When upgrading to ORC 1.5.6, the reader needs to be closed.
> ---
>
> Key: SPARK-28208
> URL: https://issues.apache.org/jira/browse/SPARK-28208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Owen O'Malley
>Assignee: Owen O'Malley
>Priority: Major
> Fix For: 3.0.0
>
>
> As part of the ORC 1.5.6 release, we optimized the common pattern of:
> {code:java}
> Reader reader = OrcFile.createReader(...);
> RecordReader rows = reader.rows(...);{code}
> which used to open one file handle in the Reader and a second one in the 
> RecordReader. Users were seeing this as a regression when moving from the old 
> Spark ORC reader via hive to the new native reader, because it opened twice 
> as many files on the NameNode.
> In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
> the Reader until it is either closed or a RecordReader is created from it. 
> This has cut down the number of file open requests on the NameNode by half in 
> typical spark applications. (Hive's ORC code avoided this problem by putting 
> the file footer in to the input splits, but that has other problems.)
> To get the new optimization without leaking file handles, Spark needs to be 
> close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29030) Simplify lookupV2Relation

2019-09-18 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29030.
-
Fix Version/s: 3.0.0
 Assignee: John Zhuge
   Resolution: Done

Resolved by [https://github.com/apache/spark/pull/25735]

> Simplify lookupV2Relation
> -
>
> Key: SPARK-29030
> URL: https://issues.apache.org/jira/browse/SPARK-29030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
> Fix For: 3.0.0
>
>
> Simplify the return type for {{lookupV2Relation}} which makes the 3 callers 
> more straightforward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16452) basic INFORMATION_SCHEMA support

2019-09-18 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-16452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16452:

Target Version/s:   (was: 3.0.0)

> basic INFORMATION_SCHEMA support
> 
>
> Key: SPARK-16452
> URL: https://issues.apache.org/jira/browse/SPARK-16452
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
> Attachments: INFORMATION_SCHEMAsupport.pdf
>
>
> INFORMATION_SCHEMA is part of SQL92 support. This ticket proposes adding a 
> few tables as defined in SQL92 standard to Spark SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26022) PySpark Comparison with Pandas

2019-09-18 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-26022.
-
Target Version/s:   (was: 3.0.0)
  Resolution: Later

> PySpark Comparison with Pandas
> --
>
> Key: SPARK-26022
> URL: https://issues.apache.org/jira/browse/SPARK-26022
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Hyukjin Kwon
>Priority: Major
>
> It would be very nice if we can have a doc like 
> https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html to show 
> the API difference between PySpark and Pandas. 
> Reference:
> https://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26022) PySpark Comparison with Pandas

2019-09-18 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932610#comment-16932610
 ] 

Xiao Li commented on SPARK-26022:
-

[https://github.com/databricks/koalas] is to close the gap. 

> PySpark Comparison with Pandas
> --
>
> Key: SPARK-26022
> URL: https://issues.apache.org/jira/browse/SPARK-26022
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Hyukjin Kwon
>Priority: Major
>
> It would be very nice if we can have a doc like 
> https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html to show 
> the API difference between PySpark and Pandas. 
> Reference:
> https://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29105) SHS may delete driver log file of in progress application

2019-09-18 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-29105:
--

Assignee: Marcelo Vanzin

> SHS may delete driver log file of in progress application
> -
>
> Key: SPARK-29105
> URL: https://issues.apache.org/jira/browse/SPARK-29105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
>
> There's an issue with how the SHS cleans driver logs that is similar to the 
> problem of event logs: because the file size is not updated when you write to 
> it, the SHS fails to detect activity and thus may delete the file while it's 
> still being written to.
> SPARK-24787 added a workaround in the SHS so that it can detect that 
> situation for in-progress apps, replacing the previous solution which was too 
> slow for event logs.
> But that doesn't work for driver logs because they do not follow the same 
> pattern (different file names for in-progress files), and thus would require 
> the SHS to open the driver log files on every scan, which is expensive.
> The old approach (using the {{hsync}} API) seems to be a good match for the 
> driver logs, though, which don't slow down the listener bus like event logs 
> do.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29105) SHS may delete driver log file of in progress application

2019-09-18 Thread Marcelo Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-29105.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25819
[https://github.com/apache/spark/pull/25819]

> SHS may delete driver log file of in progress application
> -
>
> Key: SPARK-29105
> URL: https://issues.apache.org/jira/browse/SPARK-29105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 3.0.0
>
>
> There's an issue with how the SHS cleans driver logs that is similar to the 
> problem of event logs: because the file size is not updated when you write to 
> it, the SHS fails to detect activity and thus may delete the file while it's 
> still being written to.
> SPARK-24787 added a workaround in the SHS so that it can detect that 
> situation for in-progress apps, replacing the previous solution which was too 
> slow for event logs.
> But that doesn't work for driver logs because they do not follow the same 
> pattern (different file names for in-progress files), and thus would require 
> the SHS to open the driver log files on every scan, which is expensive.
> The old approach (using the {{hsync}} API) seems to be a good match for the 
> driver logs, though, which don't slow down the listener bus like event logs 
> do.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29104) Fix Flaky Test - PipedRDDSuite. stdin_writer_thread_should_be_exited_when_task_is_finished

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29104:
--
Fix Version/s: 2.4.5

> Fix Flaky Test - PipedRDDSuite. 
> stdin_writer_thread_should_be_exited_when_task_is_finished
> --
>
> Key: SPARK-29104
> URL: https://issues.apache.org/jira/browse/SPARK-29104
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6867/testReport/junit/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26713) PipedRDD may holds stdin writer and stdout read threads even if the task is finished

2019-09-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26713:
--
Fix Version/s: 2.4.5

> PipedRDD may holds stdin writer and stdout read threads even if the task is 
> finished
> 
>
> Key: SPARK-26713
> URL: https://issues.apache.org/jira/browse/SPARK-26713
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.3, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> During an investigation of OOM of one internal production job, I found that 
> PipedRDD leaks memory. After some digging, the problem lies down to the fact 
> that PipedRDD doesn't release stdin writer and stdout threads even if the 
> task is finished.
>  
> PipedRDD creates two threads: stdin writer and stdout reader. If we are lucky 
> and the task is finished normally, these two threads exit normally. If the 
> subprocess(pipe command) is failed, the task will be marked failed, however 
> the stdin writer will be still running until it consumes its parent RDD's 
> iterator. There is even a race condition with ShuffledRDD + PipedRDD: the 
> ShuffleBlockFetchIterator is cleaned up at task completion and hangs stdin 
> writer thread, which leaks memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2019-09-18 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27495:
--
Epic Name: Stage Level Scheduling

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>  Labels: SPIP
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max resources needed for any part of the job.  To do this 
> today, you can break it into separate jobs where each job requests the 
> corresponding resources needed, but then you have to write the

[jira] [Updated] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2019-09-18 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27495:
--
Labels: SPIP  (was: )

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>  Labels: SPIP
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max resources needed for any part of the job.  To do this 
> today, you can break it into separate jobs where each job requests the 
> corresponding resources needed, but then you have to write the data out 
>

[jira] [Created] (SPARK-29154) Update Spark scheduler for stage level scheduling

2019-09-18 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-29154:
-

 Summary: Update Spark scheduler for stage level scheduling
 Key: SPARK-29154
 URL: https://issues.apache.org/jira/browse/SPARK-29154
 Project: Spark
  Issue Type: Story
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: Thomas Graves


Make the changes to DAGscheduler, stage, task set manager, task scheduler to 
support scheduling based on the resource profiles.  Note that the logic to 
merge profiles has a separate jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29150) Update RDD API for Stage level scheduling

2019-09-18 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-29150:
--
Description: 
See the SPIP and design doc attached to SPARK-27495.

 Note this is meant to be the final task of updating the actual public RDD Api, 
we want all the other changes in place before enabling this.

  was:
See the SPIP and design doc attached to SPARK-27495.

 


> Update RDD API for Stage level scheduling
> -
>
> Key: SPARK-29150
> URL: https://issues.apache.org/jira/browse/SPARK-29150
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> See the SPIP and design doc attached to SPARK-27495.
>  Note this is meant to be the final task of updating the actual public RDD 
> Api, we want all the other changes in place before enabling this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29153) ResourceProfile conflict resolution stage level scheduling

2019-09-18 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-29153:
-

 Summary: ResourceProfile conflict resolution stage level scheduling
 Key: SPARK-29153
 URL: https://issues.apache.org/jira/browse/SPARK-29153
 Project: Spark
  Issue Type: Story
  Components: Scheduler
Affects Versions: 3.0.0
Reporter: Thomas Graves


For the stage level scheduling, if a stage has ResourceProfiles from multiple 
RDD that conflict we have to resolve that conflict.

We may have 2 approaches. 
 # default to error out if conflicting, that way user realizes what is going 
on, have a config to turn this on and off.
 # If config to error out if off, then resolve the conflict.  See below from 
the design doc on the SPIP.

For the merge strategy we can choose the max from the ResourceProfiles to make 
the largest container required. This in general will work but there are a few 
cases people may have intended them to be a sum.  For instance lets say one RDD 
needs X memory and another RDD needs Y memory. It might be when those get 
combined into a stage you really need X+Y memory vs the max(X, Y). Another 
example might be union, where you would want to sum the resources of each RDD. 
I think we can document what we choose for now and later on add in the ability 
to have other alternatives then max.  Or perhaps we do need to change what we 
do either per operation or per resource type. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform

2019-09-18 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29118.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25815
[https://github.com/apache/spark/pull/25815]

> Avoid redundant computation in GMM.transform && GLR.transform
> -
>
> Key: SPARK-29118
> URL: https://issues.apache.org/jira/browse/SPARK-29118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> In SPARK-27944, the computation for output columns with empty name is skipped.
> Now, I find that we can furthermore optimize
> 1, GMM.transform by directly obtaining the prediction(double) from its 
> probabilty prediction(vector), like what ProbabilisticClassificationModel and 
> ClassificationModel do.
> 2, GLR.transform by obtaining the prediction(double) from its link 
> prediction(double)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29118) Avoid redundant computation in GMM.transform && GLR.transform

2019-09-18 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29118:
-

Assignee: zhengruifeng

> Avoid redundant computation in GMM.transform && GLR.transform
> -
>
> Key: SPARK-29118
> URL: https://issues.apache.org/jira/browse/SPARK-29118
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> In SPARK-27944, the computation for output columns with empty name is skipped.
> Now, I find that we can furthermore optimize
> 1, GMM.transform by directly obtaining the prediction(double) from its 
> probabilty prediction(vector), like what ProbabilisticClassificationModel and 
> ClassificationModel do.
> 2, GLR.transform by obtaining the prediction(double) from its link 
> prediction(double)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19926) Make pyspark exception more readable

2019-09-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-19926:


Assignee: Xianjin YE  (was: Genmao Yu)

> Make pyspark exception more readable
> 
>
> Key: SPARK-19926
> URL: https://issues.apache.org/jira/browse/SPARK-19926
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Xianjin YE
>Priority: Minor
>  Labels: bulk-closed
>
> Exception in pyspark is a little difficult to read.
> like:
> {code}
> Traceback (most recent call last):
>   File "", line 5, in 
>   File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in 
> start
> return self._sq(self._jwrite.start())
>   File 
> "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Append output mode not supported when 
> there are streaming aggregations on streaming DataFrames/DataSets without 
> watermark;;\nAggregate [window#17, word#5], [window#17 AS window#11, word#5, 
> count(1) AS count#16L]\n+- Filter ((t#6 >= window#17.start) && (t#6 < 
> window#17.end))\n   +- Expand [ArrayBuffer(named_struct(start, 
> CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
> (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 
> 3000)), word#5, t#6-T3ms), ArrayBuffer(named_struct(start, 
> CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
> (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 
> 3000)), word#5, t#6-T3ms)], [window#17, word#5, t#6-T3ms]\n  
> +- EventTimeWatermark t#6: timestamp, interval 30 seconds\n +- 
> Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6]\n   
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@c4079ca,csv,List(),Some(StructType(StructField(word,StringType,true),
>  StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> 
> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1]\n'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19926) Make pyspark exception more readable

2019-09-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19926:
-
Labels:   (was: bulk-closed)

> Make pyspark exception more readable
> 
>
> Key: SPARK-19926
> URL: https://issues.apache.org/jira/browse/SPARK-19926
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Genmao Yu
>Assignee: Xianjin YE
>Priority: Minor
>
> Exception in pyspark is a little difficult to read.
> like:
> {code}
> Traceback (most recent call last):
>   File "", line 5, in 
>   File "/root/dev/spark/dist/python/pyspark/sql/streaming.py", line 853, in 
> start
> return self._sq(self._jwrite.start())
>   File 
> "/root/dev/spark/dist/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/root/dev/spark/dist/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'Append output mode not supported when 
> there are streaming aggregations on streaming DataFrames/DataSets without 
> watermark;;\nAggregate [window#17, word#5], [window#17 AS window#11, word#5, 
> count(1) AS count#16L]\n+- Filter ((t#6 >= window#17.start) && (t#6 < 
> window#17.end))\n   +- Expand [ArrayBuffer(named_struct(start, 
> CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
> (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(0 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 
> 3000)), word#5, t#6-T3ms), ArrayBuffer(named_struct(start, 
> CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0), end, 
> (CEIL((cast((precisetimestamp(t#6) - 0) as double) / cast(3000 as 
> double))) + cast(1 as bigint)) - cast(1 as bigint)) * 3000) + 0) + 
> 3000)), word#5, t#6-T3ms)], [window#17, word#5, t#6-T3ms]\n  
> +- EventTimeWatermark t#6: timestamp, interval 30 seconds\n +- 
> Project [cast(word#0 as string) AS word#5, cast(t#1 as timestamp) AS t#6]\n   
>  +- StreamingRelation 
> DataSource(org.apache.spark.sql.SparkSession@c4079ca,csv,List(),Some(StructType(StructField(word,StringType,true),
>  StructField(t,IntegerType,true))),List(),None,Map(sep -> ;, path -> 
> /tmp/data),None), FileSource[/tmp/data], [word#0, t#1]\n'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29101) CSV datasource returns incorrect .count() from file with malformed records

2019-09-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29101.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25820
[https://github.com/apache/spark/pull/25820]

> CSV datasource returns incorrect .count() from file with malformed records
> --
>
> Key: SPARK-29101
> URL: https://issues.apache.org/jira/browse/SPARK-29101
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Stuart White
>Assignee: Sandeep Katta
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark 2.4 introduced a change to the way csv files are read.  See [Upgrading 
> From Spark SQL 2.3 to 
> 2.4|https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html#upgrading-from-spark-sql-23-to-24]
>  for more details.
> In that document, it states: _To restore the previous behavior, set 
> spark.sql.csv.parser.columnPruning.enabled to false._
> I am configuring Spark 2.4.4 as such, yet I'm still getting results 
> inconsistent with pre-2.4.  For example:
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
>  
> With Spark 2.1.1, if I call .count() on a DataFrame created from this file 
> (using option DROPMALFORMED), "3" is returned.
> {noformat}
> (using Spark 2.1.1)
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").count
> 19/09/16 14:28:01 WARN CSVRelation: Dropping malformed line: xxx
> res1: Long = 3
> {noformat}
> With Spark 2.4.4, I set the "spark.sql.csv.parser.columnPruning.enabled" 
> option to false to restore the pre-2.4 behavior for handling malformed 
> records, then call .count() and "4" is returned.
> {noformat}
> (using spark 2.4.4)
> scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").count
> res1: Long = 4
> {noformat}
> So, using the *spark.sql.csv.parser.columnPruning.enabled* option did not 
> actually restore previous behavior.
> How can I, using Spark 2.4+, get a count of the records in a .csv which 
> excludes malformed records?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 154 matches

Mail list logo