[jira] [Resolved] (SPARK-33477) Hive partition pruning support date type

2020-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33477.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30408
[https://github.com/apache/spark/pull/30408]

>  Hive partition pruning support date type
> -
>
> Key: SPARK-33477
> URL: https://issues.apache.org/jira/browse/SPARK-33477
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Hive partition pruning can support date type:
> https://issues.apache.org/jira/browse/HIVE-5679
> https://github.com/apache/hive/commit/5106bf1c8671740099fca8e1a7d4b37afe97137f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238525#comment-17238525
 ] 

Apache Spark commented on SPARK-33548:
--

User 'JQ-Cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30495

> Peak Execution Memory not display on Spark Executor UI intuitively
> --
>
> Key: SPARK-33548
> URL: https://issues.apache.org/jira/browse/SPARK-33548
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0, 3.0.1
>Reporter: xuziqiJS
>Priority: Major
>
> Now, Peak Execution Memory can only be obtained through restAPI and cannot be 
> displayed on Spark Executor UI intuitively, although spark users tune spark 
> executor memory are dependent on the metrics. Therefore, it is very important 
> to display the peak memory usage on the spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33548:


Assignee: (was: Apache Spark)

> Peak Execution Memory not display on Spark Executor UI intuitively
> --
>
> Key: SPARK-33548
> URL: https://issues.apache.org/jira/browse/SPARK-33548
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0, 3.0.1
>Reporter: xuziqiJS
>Priority: Major
>
> Now, Peak Execution Memory can only be obtained through restAPI and cannot be 
> displayed on Spark Executor UI intuitively, although spark users tune spark 
> executor memory are dependent on the metrics. Therefore, it is very important 
> to display the peak memory usage on the spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33548:


Assignee: Apache Spark

> Peak Execution Memory not display on Spark Executor UI intuitively
> --
>
> Key: SPARK-33548
> URL: https://issues.apache.org/jira/browse/SPARK-33548
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0, 3.0.1
>Reporter: xuziqiJS
>Assignee: Apache Spark
>Priority: Major
>
> Now, Peak Execution Memory can only be obtained through restAPI and cannot be 
> displayed on Spark Executor UI intuitively, although spark users tune spark 
> executor memory are dependent on the metrics. Therefore, it is very important 
> to display the peak memory usage on the spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238522#comment-17238522
 ] 

Apache Spark commented on SPARK-33548:
--

User 'JQ-Cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30495

> Peak Execution Memory not display on Spark Executor UI intuitively
> --
>
> Key: SPARK-33548
> URL: https://issues.apache.org/jira/browse/SPARK-33548
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0, 3.0.1
>Reporter: xuziqiJS
>Priority: Major
>
> Now, Peak Execution Memory can only be obtained through restAPI and cannot be 
> displayed on Spark Executor UI intuitively, although spark users tune spark 
> executor memory are dependent on the metrics. Therefore, it is very important 
> to display the peak memory usage on the spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31710) Fail casting numeric to timestamp by default

2020-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31710:
--
Labels:   (was: correctness)

> Fail casting numeric to timestamp by default
> 
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Assignee: philipse
>Priority: Major
> Fix For: 3.1.0
>
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33551) Do not use custom shuffle reader for repartition

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238516#comment-17238516
 ] 

Apache Spark commented on SPARK-33551:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/30494

> Do not use custom shuffle reader for repartition
> 
>
> Key: SPARK-33551
> URL: https://issues.apache.org/jira/browse/SPARK-33551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Wei Xue
>Priority: Major
>
> We should have a more thorough fix for all sorts of custom shuffle readers 
> when the original query has a repartition shuffle, based on the discussions 
> on the initial PR: [https://github.com/apache/spark/pull/29797].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33551) Do not use custom shuffle reader for repartition

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33551:


Assignee: Apache Spark

> Do not use custom shuffle reader for repartition
> 
>
> Key: SPARK-33551
> URL: https://issues.apache.org/jira/browse/SPARK-33551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Wei Xue
>Assignee: Apache Spark
>Priority: Major
>
> We should have a more thorough fix for all sorts of custom shuffle readers 
> when the original query has a repartition shuffle, based on the discussions 
> on the initial PR: [https://github.com/apache/spark/pull/29797].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33551) Do not use custom shuffle reader for repartition

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33551:


Assignee: (was: Apache Spark)

> Do not use custom shuffle reader for repartition
> 
>
> Key: SPARK-33551
> URL: https://issues.apache.org/jira/browse/SPARK-33551
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Wei Xue
>Priority: Major
>
> We should have a more thorough fix for all sorts of custom shuffle readers 
> when the original query has a repartition shuffle, based on the discussions 
> on the initial PR: [https://github.com/apache/spark/pull/29797].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33551) Do not use custom shuffle reader for repartition

2020-11-24 Thread Wei Xue (Jira)
Wei Xue created SPARK-33551:
---

 Summary: Do not use custom shuffle reader for repartition
 Key: SPARK-33551
 URL: https://issues.apache.org/jira/browse/SPARK-33551
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Wei Xue


We should have a more thorough fix for all sorts of custom shuffle readers when 
the original query has a repartition shuffle, based on the discussions on the 
initial PR: [https://github.com/apache/spark/pull/29797].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33550) Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2

2020-11-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33550:

Description: 
https://github.com/apache/spark/pull/30478#discussion_r529179587

> Recover hive-service-rpc to built-in Hive version when we upgrade built-in 
> Hive to 3.1.2
> 
>
> Key: SPARK-33550
> URL: https://issues.apache.org/jira/browse/SPARK-33550
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> https://github.com/apache/spark/pull/30478#discussion_r529179587



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33550) Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2

2020-11-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33550:

Description: 
Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive 
to 3.1.2. Please see 
https://github.com/apache/spark/pull/30478#discussion_r529179587 for more 
details.

  was:https://github.com/apache/spark/pull/30478#discussion_r529179587


> Recover hive-service-rpc to built-in Hive version when we upgrade built-in 
> Hive to 3.1.2
> 
>
> Key: SPARK-33550
> URL: https://issues.apache.org/jira/browse/SPARK-33550
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Recover hive-service-rpc to built-in Hive version when we upgrade built-in 
> Hive to 3.1.2. Please see 
> https://github.com/apache/spark/pull/30478#discussion_r529179587 for more 
> details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33550) Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2

2020-11-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33550:
---

 Summary: Recover hive-service-rpc to built-in Hive version when we 
upgrade built-in Hive to 3.1.2
 Key: SPARK-33550
 URL: https://issues.apache.org/jira/browse/SPARK-33550
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31710) Fail casting numeric to timestamp by default

2020-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31710:
--
Labels: correctness  (was: )

> Fail casting numeric to timestamp by default
> 
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Assignee: philipse
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0
>
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33549:


Assignee: Gengliang Wang  (was: Apache Spark)

> Remove configuration spark.sql.legacy.allowCastNumericToTimestamp
> -
>
> Key: SPARK-33549
> URL: https://issues.apache.org/jira/browse/SPARK-33549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> In the current master branch, there is a new configuration 
> `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast 
> Numeric types to Timestamp or not. The default value is true.
> After https://github.com/apache/spark/pull/30260, the type conversion between 
> Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need 
> to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` 
> for disallowing the conversion.
> We should remove the configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238501#comment-17238501
 ] 

Apache Spark commented on SPARK-33549:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30493

> Remove configuration spark.sql.legacy.allowCastNumericToTimestamp
> -
>
> Key: SPARK-33549
> URL: https://issues.apache.org/jira/browse/SPARK-33549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> In the current master branch, there is a new configuration 
> `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast 
> Numeric types to Timestamp or not. The default value is true.
> After https://github.com/apache/spark/pull/30260, the type conversion between 
> Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need 
> to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` 
> for disallowing the conversion.
> We should remove the configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33549:


Assignee: Apache Spark  (was: Gengliang Wang)

> Remove configuration spark.sql.legacy.allowCastNumericToTimestamp
> -
>
> Key: SPARK-33549
> URL: https://issues.apache.org/jira/browse/SPARK-33549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> In the current master branch, there is a new configuration 
> `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast 
> Numeric types to Timestamp or not. The default value is true.
> After https://github.com/apache/spark/pull/30260, the type conversion between 
> Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need 
> to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` 
> for disallowing the conversion.
> We should remove the configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33533) BasicConnectionProvider should consider case-sensitivity for properties.

2020-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33533.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30485
[https://github.com/apache/spark/pull/30485]

> BasicConnectionProvider should consider case-sensitivity for properties.
> 
>
> Key: SPARK-33533
> URL: https://issues.apache.org/jira/browse/SPARK-33533
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
> Fix For: 3.1.0
>
>
> After SPARK-32001, BasicConnectionProvider doesn't consider case-sensitivity 
> for properties.
> Caused by this issue, OracleIntegrationSuite doesn't pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33224) Expose watermark information on SS UI

2020-11-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-33224:


Assignee: Jungtaek Lim

> Expose watermark information on SS UI
> -
>
> Key: SPARK-33224
> URL: https://issues.apache.org/jira/browse/SPARK-33224
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Assignee: Jungtaek Lim
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33533) BasicConnectionProvider should consider case-sensitivity for properties.

2020-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33533:
--
Priority: Critical  (was: Major)

> BasicConnectionProvider should consider case-sensitivity for properties.
> 
>
> Key: SPARK-33533
> URL: https://issues.apache.org/jira/browse/SPARK-33533
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
>
> After SPARK-32001, BasicConnectionProvider doesn't consider case-sensitivity 
> for properties.
> Caused by this issue, OracleIntegrationSuite doesn't pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33224) Expose watermark information on SS UI

2020-11-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33224.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30427
[https://github.com/apache/spark/pull/30427]

> Expose watermark information on SS UI
> -
>
> Key: SPARK-33224
> URL: https://issues.apache.org/jira/browse/SPARK-33224
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp

2020-11-24 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-33549:
--

 Summary: Remove configuration 
spark.sql.legacy.allowCastNumericToTimestamp
 Key: SPARK-33549
 URL: https://issues.apache.org/jira/browse/SPARK-33549
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


In the current master branch, there is a new configuration 
`spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast 
Numeric types to Timestamp or not. The default value is true.

After https://github.com/apache/spark/pull/30260, the type conversion between 
Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need 
to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for 
disallowing the conversion.
We should remove the configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33542) Group exception messages in catalyst/catalog

2020-11-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33542:
-
Summary: Group exception messages in catalyst/catalog  (was: Group 
exceptions in catalyst/catalog)

> Group exception messages in catalyst/catalog
> 
>
> Key: SPARK-33542
> URL: https://issues.apache.org/jira/browse/SPARK-33542
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark

2020-11-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33539:
-
Description: 
In the SPIP: Standardize Exception Messages in Spark, there are three major 
improvements proposed:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

A general rule of thumb for grouping exceptions:
 * AnalysisException => QueryCompilationErrors
 * SparkException, RuntimeException(UnsupportedOperationException, 
IllegalStateException...) => QueryExecutionErrors

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
[https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

  was:
In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

A general rule of thumb for grouping exceptions:
 * AnalysisException => QueryCompilationErrors
 * SparkException, RuntimeException(UnsupportedOperationException, 
IllegalStateException...) => QueryExecutionErrors

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
[https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.


> Standardize exception messages in Spark
> ---
>
> Key: SPARK-33539
> URL: https://issues.apache.org/jira/browse/SPARK-33539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, there are three major 
> improvements proposed:
>  # Group error messages in dedicated files.
>  # Establish an error message guideline for developers.
>  # Improve error message quality.
> The first step is to centralize error messages for each component into its 
> own dedicated file(s). This can help with auditing error messages and 
> subsequent tasks to establish a guideline and improve message quality in the 
> future. 
> A general rule of thumb for grouping exceptions:
>  * AnalysisException => QueryCompilationErrors
>  * SparkException, RuntimeException(UnsupportedOperationException, 
> IllegalStateException...) => QueryExecutionErrors
> Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
> QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 
> Please see the SPIP: 
> [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33541) Group exception messages in catalyst/expressions

2020-11-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33541:
-
Summary: Group exception messages in catalyst/expressions  (was: Group 
exceptions in catalyst/expressions)

> Group exception messages in catalyst/expressions
> 
>
> Key: SPARK-33541
> URL: https://issues.apache.org/jira/browse/SPARK-33541
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33071:


Assignee: Apache Spark

> Join with ambiguous column succeeding but giving wrong output
> -
>
> Key: SPARK-33071
> URL: https://issues.apache.org/jira/browse/SPARK-33071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1, 3.1.0
>Reporter: George
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>
> When joining two datasets where one column in each dataset is sourced from 
> the same input dataset, the join successfully runs, but does not select the 
> correct columns, leading to incorrect output.
> Repro using pyspark:
> {code:java}
> sc.version
> import pyspark.sql.functions as F
> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' 
> : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, 
> 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> input_df = spark.createDataFrame(d)
> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> df1 = df1.filter(F.col("key") != F.lit("c"))
> df2 = df2.filter(F.col("key") != F.lit("d"))
> ret = df1.join(df2, df1.key == df2.key, "full").select(
> df1["key"].alias("df1_key"),
> df2["key"].alias("df2_key"),
> df1["sales"],
> df2["units"],
> F.coalesce(df1["key"], df2["key"]).alias("key"))
> ret.show()
> ret.explain(){code}
> output for 2.4.4:
> {code:java}
> >>> sc.version
> u'2.4.4'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filter(F.col("key") != F.lit("c"))
> >>> df2 = df2.filter(F.col("key") != F.lit("d"))
> >>> ret = df1.join(df2, df1.key == df2.key, "full").select(
> ... df1["key"].alias("df1_key"),
> ... df2["key"].alias("df2_key"),
> ... df1["sales"],
> ... df2["units"],
> ... F.coalesce(df1["key"], df2["key"]).alias("key"))
> 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, 
> 'key#213 = key#213'. Perhaps you need to use aliases.
> >>> ret.show()
> +---+---+-+-++
> |df1_key|df2_key|sales|units| key|
> +---+---+-+-++
> |  d|  d|3| null|   d|
> |   null|   null| null|2|null|
> |  b|  b|5|   10|   b|
> |  a|  a|3|6|   a|
> +---+---+-+-++>>> ret.explain()
> == Physical Plan ==
> *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, 
> units#230L, coalesce(key#213, key#213) AS key#260]
> +- SortMergeJoin [key#213], [key#237], FullOuter
>:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0
>:  +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)])
>: +- Exchange hashpartitioning(key#213, 200)
>:+- *(1) HashAggregate(keys=[key#213], 
> functions=[partial_sum(sales#214L)])
>:   +- *(1) Project [key#213, sales#214L]
>:  +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c))
>: +- Scan ExistingRDD[key#213,sales#214L,units#215L]
>+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0
>   +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)])
>  +- Exchange hashpartitioning(key#237, 200)
> +- *(3) HashAggregate(keys=[key#237], 
> functions=[partial_sum(units#239L)])
>+- *(3) Project [key#237, units#239L]
>   +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d))
>  +- Scan ExistingRDD[key#237,sales#238L,units#239L]
> {code}
> output for 3.0.1:
> {code:java}
> // code placeholder
> >>> sc.version
> u'3.0.1'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: 
> UserWarning: inferring schema from dict is deprecated,please use 
> pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = 

[jira] [Commented] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238481#comment-17238481
 ] 

Apache Spark commented on SPARK-33071:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/30488

> Join with ambiguous column succeeding but giving wrong output
> -
>
> Key: SPARK-33071
> URL: https://issues.apache.org/jira/browse/SPARK-33071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1, 3.1.0
>Reporter: George
>Priority: Critical
>  Labels: correctness
>
> When joining two datasets where one column in each dataset is sourced from 
> the same input dataset, the join successfully runs, but does not select the 
> correct columns, leading to incorrect output.
> Repro using pyspark:
> {code:java}
> sc.version
> import pyspark.sql.functions as F
> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' 
> : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, 
> 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> input_df = spark.createDataFrame(d)
> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> df1 = df1.filter(F.col("key") != F.lit("c"))
> df2 = df2.filter(F.col("key") != F.lit("d"))
> ret = df1.join(df2, df1.key == df2.key, "full").select(
> df1["key"].alias("df1_key"),
> df2["key"].alias("df2_key"),
> df1["sales"],
> df2["units"],
> F.coalesce(df1["key"], df2["key"]).alias("key"))
> ret.show()
> ret.explain(){code}
> output for 2.4.4:
> {code:java}
> >>> sc.version
> u'2.4.4'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filter(F.col("key") != F.lit("c"))
> >>> df2 = df2.filter(F.col("key") != F.lit("d"))
> >>> ret = df1.join(df2, df1.key == df2.key, "full").select(
> ... df1["key"].alias("df1_key"),
> ... df2["key"].alias("df2_key"),
> ... df1["sales"],
> ... df2["units"],
> ... F.coalesce(df1["key"], df2["key"]).alias("key"))
> 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, 
> 'key#213 = key#213'. Perhaps you need to use aliases.
> >>> ret.show()
> +---+---+-+-++
> |df1_key|df2_key|sales|units| key|
> +---+---+-+-++
> |  d|  d|3| null|   d|
> |   null|   null| null|2|null|
> |  b|  b|5|   10|   b|
> |  a|  a|3|6|   a|
> +---+---+-+-++>>> ret.explain()
> == Physical Plan ==
> *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, 
> units#230L, coalesce(key#213, key#213) AS key#260]
> +- SortMergeJoin [key#213], [key#237], FullOuter
>:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0
>:  +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)])
>: +- Exchange hashpartitioning(key#213, 200)
>:+- *(1) HashAggregate(keys=[key#213], 
> functions=[partial_sum(sales#214L)])
>:   +- *(1) Project [key#213, sales#214L]
>:  +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c))
>: +- Scan ExistingRDD[key#213,sales#214L,units#215L]
>+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0
>   +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)])
>  +- Exchange hashpartitioning(key#237, 200)
> +- *(3) HashAggregate(keys=[key#237], 
> functions=[partial_sum(units#239L)])
>+- *(3) Project [key#237, units#239L]
>   +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d))
>  +- Scan ExistingRDD[key#237,sales#238L,units#239L]
> {code}
> output for 3.0.1:
> {code:java}
> // code placeholder
> >>> sc.version
> u'3.0.1'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: 
> UserWarning: inferring schema from dict is deprecated,please use 
> pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = 

[jira] [Assigned] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33071:


Assignee: (was: Apache Spark)

> Join with ambiguous column succeeding but giving wrong output
> -
>
> Key: SPARK-33071
> URL: https://issues.apache.org/jira/browse/SPARK-33071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.1, 3.1.0
>Reporter: George
>Priority: Critical
>  Labels: correctness
>
> When joining two datasets where one column in each dataset is sourced from 
> the same input dataset, the join successfully runs, but does not select the 
> correct columns, leading to incorrect output.
> Repro using pyspark:
> {code:java}
> sc.version
> import pyspark.sql.functions as F
> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' 
> : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, 
> 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> input_df = spark.createDataFrame(d)
> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> df1 = df1.filter(F.col("key") != F.lit("c"))
> df2 = df2.filter(F.col("key") != F.lit("d"))
> ret = df1.join(df2, df1.key == df2.key, "full").select(
> df1["key"].alias("df1_key"),
> df2["key"].alias("df2_key"),
> df1["sales"],
> df2["units"],
> F.coalesce(df1["key"], df2["key"]).alias("key"))
> ret.show()
> ret.explain(){code}
> output for 2.4.4:
> {code:java}
> >>> sc.version
> u'2.4.4'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filter(F.col("key") != F.lit("c"))
> >>> df2 = df2.filter(F.col("key") != F.lit("d"))
> >>> ret = df1.join(df2, df1.key == df2.key, "full").select(
> ... df1["key"].alias("df1_key"),
> ... df2["key"].alias("df2_key"),
> ... df1["sales"],
> ... df2["units"],
> ... F.coalesce(df1["key"], df2["key"]).alias("key"))
> 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, 
> 'key#213 = key#213'. Perhaps you need to use aliases.
> >>> ret.show()
> +---+---+-+-++
> |df1_key|df2_key|sales|units| key|
> +---+---+-+-++
> |  d|  d|3| null|   d|
> |   null|   null| null|2|null|
> |  b|  b|5|   10|   b|
> |  a|  a|3|6|   a|
> +---+---+-+-++>>> ret.explain()
> == Physical Plan ==
> *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, 
> units#230L, coalesce(key#213, key#213) AS key#260]
> +- SortMergeJoin [key#213], [key#237], FullOuter
>:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0
>:  +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)])
>: +- Exchange hashpartitioning(key#213, 200)
>:+- *(1) HashAggregate(keys=[key#213], 
> functions=[partial_sum(sales#214L)])
>:   +- *(1) Project [key#213, sales#214L]
>:  +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c))
>: +- Scan ExistingRDD[key#213,sales#214L,units#215L]
>+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0
>   +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)])
>  +- Exchange hashpartitioning(key#237, 200)
> +- *(3) HashAggregate(keys=[key#237], 
> functions=[partial_sum(units#239L)])
>+- *(3) Project [key#237, units#239L]
>   +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d))
>  +- Scan ExistingRDD[key#237,sales#238L,units#239L]
> {code}
> output for 3.0.1:
> {code:java}
> // code placeholder
> >>> sc.version
> u'3.0.1'
> >>> import pyspark.sql.functions as F
> >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 
> >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 
> >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}]
> >>> input_df = spark.createDataFrame(d)
> /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: 
> UserWarning: inferring schema from dict is deprecated,please use 
> pyspark.sql.Row instead
>   warnings.warn("inferring schema from dict is deprecated,"
> >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales'))
> >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units'))
> >>> df1 = df1.filter(F.col("key") != F.lit("c"))
> >>> 

[jira] [Resolved] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework

2020-11-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33543.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30490
[https://github.com/apache/spark/pull/30490]

> Migrate SHOW COLUMNS to new resolution framework
> 
>
> Key: SPARK-33543
> URL: https://issues.apache.org/jira/browse/SPARK-33543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> Migrate SHOW COLUMNS to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework

2020-11-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33543:
---

Assignee: Terry Kim

> Migrate SHOW COLUMNS to new resolution framework
> 
>
> Key: SPARK-33543
> URL: https://issues.apache.org/jira/browse/SPARK-33543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Migrate SHOW COLUMNS to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33544) explode should not filter when used with CreateArray

2020-11-24 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238473#comment-17238473
 ] 

L. C. Hsieh commented on SPARK-33544:
-

Thanks [~hyukjin.kwon]. Will help review if [~tgraves] create a patch.

> explode should not filter when used with CreateArray
> 
>
> Key: SPARK-33544
> URL: https://issues.apache.org/jira/browse/SPARK-33544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
> insert a filter for not null and size > 0 when using inner explode/inline. 
> This is fine in most cases but the extra filter is not needed if the explode 
> is with a create array and not using Literals (it already handles LIterals).  
> When this happens you know that the values aren't null and it has a size.  It 
> already handles the empty array.
> for instance:
> val df = someDF.selectExpr("number", "explode(array(word, col3))")
> So in this case we shouldn't be inserting the extra Filter and that filter 
> can get pushed down into like a parquet reader as well. This is just causing 
> extra overhead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33544) explode should not filter when used with CreateArray

2020-11-24 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238469#comment-17238469
 ] 

Hyukjin Kwon commented on SPARK-33544:
--

cc [~viirya] FYI

> explode should not filter when used with CreateArray
> 
>
> Key: SPARK-33544
> URL: https://issues.apache.org/jira/browse/SPARK-33544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
> insert a filter for not null and size > 0 when using inner explode/inline. 
> This is fine in most cases but the extra filter is not needed if the explode 
> is with a create array and not using Literals (it already handles LIterals).  
> When this happens you know that the values aren't null and it has a size.  It 
> already handles the empty array.
> for instance:
> val df = someDF.selectExpr("number", "explode(array(word, col3))")
> So in this case we shouldn't be inserting the extra Filter and that filter 
> can get pushed down into like a parquet reader as well. This is just causing 
> extra overhead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively

2020-11-24 Thread xuziqiJS (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238466#comment-17238466
 ] 

xuziqiJS commented on SPARK-33548:
--

i will fix it,please assign the task to me 

> Peak Execution Memory not display on Spark Executor UI intuitively
> --
>
> Key: SPARK-33548
> URL: https://issues.apache.org/jira/browse/SPARK-33548
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0, 3.0.1
>Reporter: xuziqiJS
>Priority: Major
>
> Now, Peak Execution Memory can only be obtained through restAPI and cannot be 
> displayed on Spark Executor UI intuitively, although spark users tune spark 
> executor memory are dependent on the metrics. Therefore, it is very important 
> to display the peak memory usage on the spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively

2020-11-24 Thread xuziqiJS (Jira)
xuziqiJS created SPARK-33548:


 Summary: Peak Execution Memory not display on Spark Executor UI 
intuitively
 Key: SPARK-33548
 URL: https://issues.apache.org/jira/browse/SPARK-33548
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.1, 3.0.0
Reporter: xuziqiJS


Now, Peak Execution Memory can only be obtained through restAPI and cannot be 
displayed on Spark Executor UI intuitively, although spark users tune spark 
executor memory are dependent on the metrics. Therefore, it is very important 
to display the peak memory usage on the spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33494) Do not use local shuffle reader for repartition

2020-11-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33494.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30432
[https://github.com/apache/spark/pull/30432]

> Do not use local shuffle reader for repartition
> ---
>
> Key: SPARK-33494
> URL: https://issues.apache.org/jira/browse/SPARK-33494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33547) Doc Type Construct Literal usage

2020-11-24 Thread angerszhu (Jira)
angerszhu created SPARK-33547:
-

 Summary: Doc Type Construct Literal usage
 Key: SPARK-33547
 URL: https://issues.apache.org/jira/browse/SPARK-33547
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.1.0
Reporter: angerszhu


Add Doc about type construct literal in    
[https://spark.apache.org/docs/3.0.1/sql-ref-literals.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33547) Doc Type Construct Literal usage

2020-11-24 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238453#comment-17238453
 ] 

angerszhu commented on SPARK-33547:
---

Working on this

> Doc Type Construct Literal usage
> 
>
> Key: SPARK-33547
> URL: https://issues.apache.org/jira/browse/SPARK-33547
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Add Doc about type construct literal in    
> [https://spark.apache.org/docs/3.0.1/sql-ref-literals.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33546) CREATE TABLE LIKE should resolve hive serde correctly like CREATE TABLE

2020-11-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-33546:

Description: 
Currently there are several inconsistency:
 # CREATE TABLE LIKE does not validate the user-specified hive serde. e.g., 
STORED AS PARQUET can't be used with ROW FORMAT SERDE.
 # CREATE TABLE LIKE requires STORED AS and ROW FORMAT SERDE to be specified 
together, which is not necessary.
 # CREATE TABLE LIKE does not respect the default hive serde.

> CREATE TABLE LIKE should resolve hive serde correctly like CREATE TABLE
> ---
>
> Key: SPARK-33546
> URL: https://issues.apache.org/jira/browse/SPARK-33546
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> Currently there are several inconsistency:
>  # CREATE TABLE LIKE does not validate the user-specified hive serde. e.g., 
> STORED AS PARQUET can't be used with ROW FORMAT SERDE.
>  # CREATE TABLE LIKE requires STORED AS and ROW FORMAT SERDE to be specified 
> together, which is not necessary.
>  # CREATE TABLE LIKE does not respect the default hive serde.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)

2020-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33252.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30413
[https://github.com/apache/spark/pull/30413]

> Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
> -
>
> Key: SPARK-33252
> URL: https://issues.apache.org/jira/browse/SPARK-33252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.2.0
>
>
>  This JIRA targets to migrate to NumPy documentation style in MLlib 
> (pyspark.mllib.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33546) CREATE TABLE LIKE should resolve hive serde correctly like CREATE TABLE

2020-11-24 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-33546:
---

 Summary: CREATE TABLE LIKE should resolve hive serde correctly 
like CREATE TABLE
 Key: SPARK-33546
 URL: https://issues.apache.org/jira/browse/SPARK-33546
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)

2020-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33252:


Assignee: Maciej Szymkiewicz

> Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
> -
>
> Key: SPARK-33252
> URL: https://issues.apache.org/jira/browse/SPARK-33252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in MLlib 
> (pyspark.mllib.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)

2020-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33252:
-
Fix Version/s: (was: 3.2.0)
   3.1.0

> Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
> -
>
> Key: SPARK-33252
> URL: https://issues.apache.org/jira/browse/SPARK-33252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
>  This JIRA targets to migrate to NumPy documentation style in MLlib 
> (pyspark.mllib.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33534) Allow specifying a minimum number of bytes in a split of a file

2020-11-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33534:
-
Component/s: (was: Input/Output)
 SQL

> Allow specifying a minimum number of bytes in a split of a file
> ---
>
> Key: SPARK-33534
> URL: https://issues.apache.org/jira/browse/SPARK-33534
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Niels Basjes
>Priority: Major
>
> *Background*
>  Long time ago I have written a way for reading a (usually large) Gzipped 
> file in a way that allows better distribution of the load over an Apache 
> Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip]
> Seems like people still need this kind of functionality and it turns out my 
> code works without modification in conjunction with Apache Spark.
>  See for example:
>  - SPARK-29102
>  - [https://stackoverflow.com/q/28127119/877069]
>  - [https://stackoverflow.com/q/27531816/877069]
> So [~nchammas] provided documentation to my project a while ago on how to use 
> it with Spark.
>  [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md]
> *The problem*
>  Now some people have indicated getting errors from this feature of mine.
> Fact is that this functionality cannot read a split if it is too small (the 
> number of bytes read from disk and the number of bytes coming out the 
> compression are different). So my code uses the {{io.file.buffer.size}} 
> setting but also has a hard coded lower limit split size of 4 KiB.
> Now the problem I found when looking into the reports I got is that Spark 
> does not have a minimum number of bytes in a split.
> In fact: When I created a test file and then set the 
> {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of 
> my test file my library gave the error:
> {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] 
> is 1 bytes which is too small. (Minimum is 65536)}}
> I found the code that does this calculation here 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74
> *Proposed enhancement*
> So what I propose is to have a new setting 
> ({{spark.sql.files.minPartitionBytes}}  ?) that will guarantee that no split 
> of a file is smaller than a configured number of bytes.
> I also propose to have this set to something like 64KiB as a default.
> Having some constraints on the values of 
> {{spark.sql.files.minPartitionBytes}} and possibly in relation with 
> {{spark.sql.files.maxPartitionBytes}} would be fine.
> *Notes*
> Hadoop already has code that does this: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33457) Adjust mypy configuration

2020-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33457:


Assignee: Maciej Szymkiewicz

> Adjust mypy configuration
> -
>
> Key: SPARK-33457
> URL: https://issues.apache.org/jira/browse/SPARK-33457
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> At the moment, with exception to type ignores, we use default MyPy 
> configuration. These already provide decent coverage, but are somewhat less 
> restrictive than the ones used in {{typeshed}} and {{pyspark-stubs}}.
> We should consider at least the following:
> - {{strict_optional}}
> - {{no_implicit_optional}}
> It might be also a good idea to add {{disallow_untyped_defs}}, which will 
> allow us to catch any instances of user-facing code, that are missing 
> annotations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33457) Adjust mypy configuration

2020-11-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33457.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30382
[https://github.com/apache/spark/pull/30382]

> Adjust mypy configuration
> -
>
> Key: SPARK-33457
> URL: https://issues.apache.org/jira/browse/SPARK-33457
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> At the moment, with exception to type ignores, we use default MyPy 
> configuration. These already provide decent coverage, but are somewhat less 
> restrictive than the ones used in {{typeshed}} and {{pyspark-stubs}}.
> We should consider at least the following:
> - {{strict_optional}}
> - {{no_implicit_optional}}
> It might be also a good idea to add {{disallow_untyped_defs}}, which will 
> allow us to catch any instances of user-facing code, that are missing 
> annotations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2020-11-24 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238411#comment-17238411
 ] 

Asif edited comment on SPARK-19875 at 11/24/20, 11:43 PM:
--

[~maropu], [~sameerag]  [~jay.pranavamurthi] I have generated a PR for 
SPARK-3152 which fixes the OOM or unreasonable compile time in queries.

The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185]

I cannot get any body for code review.

The explanation of the logic used is in the PR.

If needed we can go through the code together. This is going to be used by 
workday in production.


was (Author: ashahid7):
[~maropu], [~sameerag]  [~jay.pranavamurthi] I have generated a PR for 
SPARK-3152 which fixes the OOM or unreasonable compile time in queries.

The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185]

I cannot get any body for code review.

The explanation of the logic used is in the PR

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Priority: Major
>  Labels: bulk-closed
> Attachments: TestFilter.scala, test10cols.csv, test50cols.csv
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2020-11-24 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238411#comment-17238411
 ] 

Asif commented on SPARK-19875:
--

[~maropu] I have generated a PR for SPARK-3152 which fixes the OOM or 
unreasonable compile time in queries.

The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185]

I cannot get any body for code review.

The explanation of the logic used is in the PR

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Priority: Major
>  Labels: bulk-closed
> Attachments: TestFilter.scala, test10cols.csv, test50cols.csv
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code

2020-11-24 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238411#comment-17238411
 ] 

Asif edited comment on SPARK-19875 at 11/24/20, 11:42 PM:
--

[~maropu], [~sameerag]  [~jay.pranavamurthi] I have generated a PR for 
SPARK-3152 which fixes the OOM or unreasonable compile time in queries.

The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185]

I cannot get any body for code review.

The explanation of the logic used is in the PR


was (Author: ashahid7):
[~maropu] I have generated a PR for SPARK-3152 which fixes the OOM or 
unreasonable compile time in queries.

The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185]

I cannot get any body for code review.

The explanation of the logic used is in the PR

> Map->filter on many columns gets stuck in constraint inference optimization 
> code
> 
>
> Key: SPARK-19875
> URL: https://issues.apache.org/jira/browse/SPARK-19875
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Priority: Major
>  Labels: bulk-closed
> Attachments: TestFilter.scala, test10cols.csv, test50cols.csv
>
>
> The attached code (TestFilter.scala) works with a 10-column csv dataset, but 
> gets stuck with a 50-column csv dataset. Both datasets are attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33287) Expose state custom metrics information on SS UI

2020-11-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-33287:


Assignee: Gabor Somogyi

> Expose state custom metrics information on SS UI
> 
>
> Key: SPARK-33287
> URL: https://issues.apache.org/jira/browse/SPARK-33287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> Since not all custom metrics hold useful information it would be good to add 
> exclude possibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33287) Expose state custom metrics information on SS UI

2020-11-24 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33287.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30336
[https://github.com/apache/spark/pull/30336]

> Expose state custom metrics information on SS UI
> 
>
> Key: SPARK-33287
> URL: https://issues.apache.org/jira/browse/SPARK-33287
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Since not all custom metrics hold useful information it would be good to add 
> exclude possibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33545) Support Fallback Storage during Worker decommission

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33545:


Assignee: (was: Apache Spark)

> Support Fallback Storage during Worker decommission
> ---
>
> Key: SPARK-33545
> URL: https://issues.apache.org/jira/browse/SPARK-33545
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33545) Support Fallback Storage during Worker decommission

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238392#comment-17238392
 ] 

Apache Spark commented on SPARK-33545:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30492

> Support Fallback Storage during Worker decommission
> ---
>
> Key: SPARK-33545
> URL: https://issues.apache.org/jira/browse/SPARK-33545
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33545) Support Fallback Storage during Worker decommission

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33545:


Assignee: (was: Apache Spark)

> Support Fallback Storage during Worker decommission
> ---
>
> Key: SPARK-33545
> URL: https://issues.apache.org/jira/browse/SPARK-33545
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33545) Support Fallback Storage during Worker decommission

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33545:


Assignee: Apache Spark

> Support Fallback Storage during Worker decommission
> ---
>
> Key: SPARK-33545
> URL: https://issues.apache.org/jira/browse/SPARK-33545
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33545) Support Fallback Storage during Worker decommission

2020-11-24 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-33545:
-

 Summary: Support Fallback Storage during Worker decommission
 Key: SPARK-33545
 URL: https://issues.apache.org/jira/browse/SPARK-33545
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33544) explode should not filter when used with CreateArray

2020-11-24 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238371#comment-17238371
 ] 

Thomas Graves commented on SPARK-33544:
---

I'm working on a patch for this.

> explode should not filter when used with CreateArray
> 
>
> Key: SPARK-33544
> URL: https://issues.apache.org/jira/browse/SPARK-33544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
> insert a filter for not null and size > 0 when using inner explode/inline. 
> This is fine in most cases but the extra filter is not needed if the explode 
> is with a create array and not using Literals (it already handles LIterals).  
> When this happens you know that the values aren't null and it has a size.  It 
> already handles the empty array.
> for instance:
> val df = someDF.selectExpr("number", "explode(array(word, col3))")
> So in this case we shouldn't be inserting the extra Filter and that filter 
> can get pushed down into like a parquet reader as well. This is just causing 
> extra overhead.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33544) explode should not filter when used with CreateArray

2020-11-24 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-33544:
-

 Summary: explode should not filter when used with CreateArray
 Key: SPARK-33544
 URL: https://issues.apache.org/jira/browse/SPARK-33544
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Thomas Graves


https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to 
insert a filter for not null and size > 0 when using inner explode/inline. This 
is fine in most cases but the extra filter is not needed if the explode is with 
a create array and not using Literals (it already handles LIterals).  When this 
happens you know that the values aren't null and it has a size.  It already 
handles the empty array.

for instance:

val df = someDF.selectExpr("number", "explode(array(word, col3))")

So in this case we shouldn't be inserting the extra Filter and that filter can 
get pushed down into like a parquet reader as well. This is just causing extra 
overhead.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33492) DSv2: Append/Overwrite/ReplaceTable should invalidate cache

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238363#comment-17238363
 ] 

Apache Spark commented on SPARK-33492:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30491

> DSv2: Append/Overwrite/ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-33492
> URL: https://issues.apache.org/jira/browse/SPARK-33492
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> Unlike in DSv1, currently in DSv2 we don't invalidate table caches for 
> operations such as append, overwrite table by expr/partition, replace table, 
> etc. We should fix these so that the behavior is consistent between v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-11-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-32670:
---

Assignee: Xinyi Yu  (was: Xiao Li)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xinyi Yu
>Priority: Minor
> Fix For: 3.1.0
>
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file

2020-11-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-32670:

Parent: SPARK-33539
Issue Type: Sub-task  (was: Improvement)

> Group exception messages in Catalyst Analyzer in one file
> -
>
> Key: SPARK-32670
> URL: https://issues.apache.org/jira/browse/SPARK-32670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 3.1.0
>
>
> For standardization of error messages and its maintenance, we can try to 
> group the exception messages into a single file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33543:


Assignee: (was: Apache Spark)

> Migrate SHOW COLUMNS to new resolution framework
> 
>
> Key: SPARK-33543
> URL: https://issues.apache.org/jira/browse/SPARK-33543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate SHOW COLUMNS to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33543:


Assignee: Apache Spark

> Migrate SHOW COLUMNS to new resolution framework
> 
>
> Key: SPARK-33543
> URL: https://issues.apache.org/jira/browse/SPARK-33543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Minor
>
> Migrate SHOW COLUMNS to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238343#comment-17238343
 ] 

Apache Spark commented on SPARK-33543:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/30490

> Migrate SHOW COLUMNS to new resolution framework
> 
>
> Key: SPARK-33543
> URL: https://issues.apache.org/jira/browse/SPARK-33543
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Minor
>
> Migrate SHOW COLUMNS to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework

2020-11-24 Thread Terry Kim (Jira)
Terry Kim created SPARK-33543:
-

 Summary: Migrate SHOW COLUMNS to new resolution framework
 Key: SPARK-33543
 URL: https://issues.apache.org/jira/browse/SPARK-33543
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Terry Kim


Migrate SHOW COLUMNS to new resolution framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33542) Group exceptions in catalyst/catalog

2020-11-24 Thread Allison Wang (Jira)
Allison Wang created SPARK-33542:


 Summary: Group exceptions in catalyst/catalog
 Key: SPARK-33542
 URL: https://issues.apache.org/jira/browse/SPARK-33542
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33541) Group exceptions in catalyst/expressions

2020-11-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33541:
-
Summary: Group exceptions in catalyst/expressions  (was: Group 
AnalysisException in catalyst/expressions into QueryCompilationErrors)

> Group exceptions in catalyst/expressions
> 
>
> Key: SPARK-33541
> URL: https://issues.apache.org/jira/browse/SPARK-33541
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark

2020-11-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33539:
-
Description: 
In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

A general rule of thumb for grouping exceptions:
 * AnalysisException => QueryCompilationErrors
 * SparkException, RuntimeException(UnsupportedOperationException, 
IllegalStateException...) => QueryExecutionErrors

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
[https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

  was:
In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
[https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.


> Standardize exception messages in Spark
> ---
>
> Key: SPARK-33539
> URL: https://issues.apache.org/jira/browse/SPARK-33539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
> major tasks to standardize exception messages in Spark:
>  # Group error messages in dedicated files.
>  # Establish an error message guideline for developers.
>  # Improve error message quality.
> The first step is to centralize error messages for each component into its 
> own dedicated file(s). This can help with auditing error messages and 
> subsequent tasks to establish a guideline and improve message quality in the 
> future. 
> A general rule of thumb for grouping exceptions:
>  * AnalysisException => QueryCompilationErrors
>  * SparkException, RuntimeException(UnsupportedOperationException, 
> IllegalStateException...) => QueryExecutionErrors
> Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
> QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 
> Please see the SPIP: 
> [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33541) Group AnalysisException in catalyst/expressions into QueryCompilationErrors

2020-11-24 Thread Allison Wang (Jira)
Allison Wang created SPARK-33541:


 Summary: Group AnalysisException in catalyst/expressions into 
QueryCompilationErrors
 Key: SPARK-33541
 URL: https://issues.apache.org/jira/browse/SPARK-33541
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running

2020-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24266:
--
Fix Version/s: 2.4.8

> Spark client terminates while driver is still running
> -
>
> Key: SPARK-24266
> URL: https://issues.apache.org/jira/browse/SPARK-24266
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 3.0.0
>Reporter: Chun Chen
>Assignee: Stijn De Haes
>Priority: Critical
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> {code}
> Warning: Ignoring non-spark config property: Default=system properties 
> included when running spark-submit.
> 18/05/11 14:50:12 WARN Config: Error reading service account token from: 
> [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring.
> 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: 
> Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf)
> 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. 
> Mounting Hadoop specific files
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: N/A
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: N/A
>container images: N/A
>phase: Pending
>status: []
> 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> default-token-xvjt9
>node name: tbds-100-98-45-69
>start time: 2018-05-11T06:50:17Z
>container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9
>phase: Pending
>status: [ContainerStatus(containerID=null, 
> image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, 
> lastState=ContainerState(running=null, terminated=null, waiting=null, 
> additionalProperties={}), name=spark-kubernetes-driver, ready=false, 
> restartCount=0, state=ContainerState(running=null, terminated=null, 
> waiting=ContainerStateWaiting(message=null, reason=PodInitializing, 
> additionalProperties={}), additionalProperties={}), additionalProperties={})]
> 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to 
> finish...
> 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
>pod name: spark-64-293-980-1526021412180-driver
>namespace: tione-603074457
>labels: network -> FLOATINGIP, spark-app-selector -> 
> spark-2843da19c690485b93780ad7992a101e, spark-role -> driver
>pod uid: 90558303-54e7-11e8-9e64-525400da65d8
>creation time: 2018-05-11T06:50:17Z
>service account name: default
>volumes: spark-local-dir-0-spark-local, spark-init-properties, 
> download-jars-volume, download-files, spark-init-secret, hadoop-properties, 
> 

[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark

2020-11-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33539:
-
Description: 
In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This change can help with auditing error messages and 
subsequent tasks to establish a guideline and improve message quality in the 
future. 

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
[https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

  was:
In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This change can help with auditing error messages and 
subsequent tasks to establish a guideline and improve message quality in the 
future. 

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6–hIFOaNUNSlpaOIZs/edit?usp=sharing
 for more details.


> Standardize exception messages in Spark
> ---
>
> Key: SPARK-33539
> URL: https://issues.apache.org/jira/browse/SPARK-33539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
> major tasks to standardize exception messages in Spark:
>  # Group error messages in dedicated files.
>  # Establish an error message guideline for developers.
>  # Improve error message quality.
> The first step is to centralize error messages for each component into its 
> own dedicated file(s). This change can help with auditing error messages and 
> subsequent tasks to establish a guideline and improve message quality in the 
> future. 
> Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
> QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 
> Please see the SPIP: 
> [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark

2020-11-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-33539:
-
Description: 
In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This can help with auditing error messages and subsequent 
tasks to establish a guideline and improve message quality in the future. 

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
[https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.

  was:
In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This change can help with auditing error messages and 
subsequent tasks to establish a guideline and improve message quality in the 
future. 

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
[https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
 for more details.


> Standardize exception messages in Spark
> ---
>
> Key: SPARK-33539
> URL: https://issues.apache.org/jira/browse/SPARK-33539
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
> major tasks to standardize exception messages in Spark:
>  # Group error messages in dedicated files.
>  # Establish an error message guideline for developers.
>  # Improve error message quality.
> The first step is to centralize error messages for each component into its 
> own dedicated file(s). This can help with auditing error messages and 
> subsequent tasks to establish a guideline and improve message quality in the 
> future. 
> Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
> QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 
> Please see the SPIP: 
> [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing]
>  for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33540) Subexpression elimination for interpreted predicate

2020-11-24 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33540:
---

 Summary: Subexpression elimination for interpreted predicate
 Key: SPARK-33540
 URL: https://issues.apache.org/jira/browse/SPARK-33540
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


We can support subexpression elimination for interpreted predicate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33539) Standardize exception messages in Spark

2020-11-24 Thread Allison Wang (Jira)
Allison Wang created SPARK-33539:


 Summary: Standardize exception messages in Spark
 Key: SPARK-33539
 URL: https://issues.apache.org/jira/browse/SPARK-33539
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.1.0
Reporter: Allison Wang


In the SPIP: Standardize Exception Messages in Spark, we have proposed three 
major tasks to standardize exception messages in Spark:
 # Group error messages in dedicated files.
 # Establish an error message guideline for developers.
 # Improve error message quality.

The first step is to centralize error messages for each component into its own 
dedicated file(s). This change can help with auditing error messages and 
subsequent tasks to establish a guideline and improve message quality in the 
future. 

Here is an example RP to group all `AnalysisExcpetion` in Analyzer into 
QueryCompilationErrors:  [https://github.com/apache/spark/pull/29497] 

Please see the SPIP: 
https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6–hIFOaNUNSlpaOIZs/edit?usp=sharing
 for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script

2020-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33535.
---
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30487
[https://github.com/apache/spark/pull/30487]

> export LANG to en_US.UTF-8 in jenkins test script
> -
>
> Key: SPARK-33535
> URL: https://issues.apache.org/jira/browse/SPARK-33535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
>  
> {code:java}
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5
>  get binary type{code}
>  
> failed Jenkins tests  and passed GitHub Actions. The error message as follows:
>  
>  
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not 
> equal "[�]("Stacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {code}
>  
> seems that the "LANG" of some  build machines is not "en_US.UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script

2020-11-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33535:
-

Assignee: Yang Jie

> export LANG to en_US.UTF-8 in jenkins test script
> -
>
> Key: SPARK-33535
> URL: https://issues.apache.org/jira/browse/SPARK-33535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
>  
> {code:java}
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5
>  get binary type{code}
>  
> failed Jenkins tests  and passed GitHub Actions. The error message as follows:
>  
>  
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not 
> equal "[�]("Stacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {code}
>  
> seems that the "LANG" of some  build machines is not "en_US.UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator

2020-11-24 Thread Mori[A]rty (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mori[A]rty updated SPARK-33531:
---
Description: 
Using a new method SparkPlan#executeTakeToIterator to implement 
CollectLimitExec#executeToIterator to avoid shuffle caused by invoking parent 
method SparkPlan#executeToIterator.

When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect 
is enabled, extra shuffle will lead to a significant performance issue for SQLs 
terminated with LIMIT.

  was:
CollectLimitExec#executeToIterator should be implemented using 
CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent 
method SparkPlan#executeToIterator.

When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect 
is enabled, this will lead to a significant performance issue for SQLs 
terminated with LIMIT.


> [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
> ---
>
> Key: SPARK-33531
> URL: https://issues.apache.org/jira/browse/SPARK-33531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Mori[A]rty
>Priority: Major
>
> Using a new method SparkPlan#executeTakeToIterator to implement 
> CollectLimitExec#executeToIterator to avoid shuffle caused by invoking parent 
> method SparkPlan#executeToIterator.
> When running a SparkThriftServer and 
> spark.sql.thriftServer.incrementalCollect is enabled, extra shuffle will lead 
> to a significant performance issue for SQLs terminated with LIMIT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33531:


Assignee: (was: Apache Spark)

> [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
> ---
>
> Key: SPARK-33531
> URL: https://issues.apache.org/jira/browse/SPARK-33531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Mori[A]rty
>Priority: Major
>
> CollectLimitExec#executeToIterator should be implemented using 
> CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent 
> method SparkPlan#executeToIterator.
> When running a SparkThriftServer and 
> spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a 
> significant performance issue for SQLs terminated with LIMIT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238257#comment-17238257
 ] 

Apache Spark commented on SPARK-33531:
--

User 'hammertank' has created a pull request for this issue:
https://github.com/apache/spark/pull/30489

> [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
> ---
>
> Key: SPARK-33531
> URL: https://issues.apache.org/jira/browse/SPARK-33531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Mori[A]rty
>Priority: Major
>
> CollectLimitExec#executeToIterator should be implemented using 
> CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent 
> method SparkPlan#executeToIterator.
> When running a SparkThriftServer and 
> spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a 
> significant performance issue for SQLs terminated with LIMIT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33531:


Assignee: Apache Spark

> [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
> ---
>
> Key: SPARK-33531
> URL: https://issues.apache.org/jira/browse/SPARK-33531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Mori[A]rty
>Assignee: Apache Spark
>Priority: Major
>
> CollectLimitExec#executeToIterator should be implemented using 
> CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent 
> method SparkPlan#executeToIterator.
> When running a SparkThriftServer and 
> spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a 
> significant performance issue for SQLs terminated with LIMIT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32792) Improve in filter pushdown for ParquetFilters

2020-11-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32792:

Parent: SPARK-25419
Issue Type: Sub-task  (was: Improvement)

> Improve in filter pushdown for ParquetFilters
> -
>
> Key: SPARK-32792
> URL: https://issues.apache.org/jira/browse/SPARK-32792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` 
> maximum value  when its values exceeds 
> `spark.sql.parquet.pushdown.inFilterThreshold`. For example:
> ```sql
> SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15)
> ```
> We will push down `id >= 1 and id <= 15`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33538) Directly push IN predicates to the Hive Metastore

2020-11-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33538:
---

 Summary: Directly push IN predicates to the Hive Metastore
 Key: SPARK-33538
 URL: https://issues.apache.org/jira/browse/SPARK-33538
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


Hive 2.0 support directly push IN predicates to the Hive Metastore. Plase see 
https://issues.apache.org/jira/browse/HIVE-11726 for more detail.

We should use this api to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33477) Hive partition pruning support date type

2020-11-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33477:

Parent: SPARK-33537
Issue Type: Sub-task  (was: Improvement)

>  Hive partition pruning support date type
> -
>
> Key: SPARK-33477
> URL: https://issues.apache.org/jira/browse/SPARK-33477
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Hive partition pruning can support date type:
> https://issues.apache.org/jira/browse/HIVE-5679
> https://github.com/apache/hive/commit/5106bf1c8671740099fca8e1a7d4b37afe97137f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table

2020-11-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27421:

Parent: SPARK-33537
Issue Type: Sub-task  (was: Bug)

> RuntimeException when querying a view on a partitioned parquet table
> 
>
> Key: SPARK-27421
> URL: https://issues.apache.org/jira/browse/SPARK-27421
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1
> Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit 
> Server VM, Java 1.8.0_141)
>Reporter: Eric Maynard
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> When running a simple query, I get the following stacktrace:
> {code}
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>  at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>  at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
>  at 

[jira] [Updated] (SPARK-33458) Hive partition pruning support Contains, StartsWith and EndsWith predicate

2020-11-24 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-33458:

Parent: SPARK-33537
Issue Type: Sub-task  (was: Improvement)

> Hive partition pruning support Contains, StartsWith and EndsWith predicate
> --
>
> Key: SPARK-33458
> URL: https://issues.apache.org/jira/browse/SPARK-33458
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.1.0
>
>
> Hive partition pruning can support Contains, StartsWith and EndsWith 
> predicate:
> https://github.com/apache/hive/blob/0c2c8a7f57330880f156466526bc0fdc94681035/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L1074-L1075
> https://github.com/apache/hive/commit/0c2c8a7f57330880f156466526bc0fdc94681035#diff-b1200d4259fafd48d7bbd0050e89772218813178f68461a2e82551c52319b282



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33537) Hive Metastore filter pushdown improvement

2020-11-24 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33537:
---

 Summary: Hive Metastore filter pushdown improvement
 Key: SPARK-33537
 URL: https://issues.apache.org/jira/browse/SPARK-33537
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang
Assignee: Yuming Wang


This umbrella ticket to track Hive Metastore filter pushdown improvement. It 
includes:
1. Date type push down
2. Like push down
3. InSet pushdown improvement
and other fixes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator

2020-11-24 Thread Mori[A]rty (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mori[A]rty updated SPARK-33531:
---
Description: 
CollectLimitExec#executeToIterator should be implemented using 
CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent 
method SparkPlan#executeToIterator.

When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect 
is enabled, this will lead to a significant performance issue for SQLs 
terminated with LIMIT.

  was:CollectLimitExec#executeToIterator should be implemented using 
CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent 
method SparkPlan#executeToIterator.


> [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
> ---
>
> Key: SPARK-33531
> URL: https://issues.apache.org/jira/browse/SPARK-33531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Mori[A]rty
>Priority: Major
>
> CollectLimitExec#executeToIterator should be implemented using 
> CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent 
> method SparkPlan#executeToIterator.
> When running a SparkThriftServer and 
> spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a 
> significant performance issue for SQLs terminated with LIMIT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33536) Incorrect join results when joining twice with the same DF

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238174#comment-17238174
 ] 

Apache Spark commented on SPARK-33536:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/30488

> Incorrect join results when joining twice with the same DF
> --
>
> Key: SPARK-33536
> URL: https://issues.apache.org/jira/browse/SPARK-33536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> val emp1 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop"),
>   TestData(4, "IT")).toDS()
> val emp2 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop")).toDS()
> val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
> emp1.join(emp3, emp1.col("key") === emp3.col("key"), 
> "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show()
> // wrong result
> +---+-+---+
> |key|value| e2|
> +---+-+---+
> |  1|sales|  1|
> |  2|personnel|  2|
> |  3|  develop|  3|
> |  4|   IT|  4|
> +---+-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33536) Incorrect join results when joining twice with the same DF

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33536:


Assignee: (was: Apache Spark)

> Incorrect join results when joining twice with the same DF
> --
>
> Key: SPARK-33536
> URL: https://issues.apache.org/jira/browse/SPARK-33536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> val emp1 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop"),
>   TestData(4, "IT")).toDS()
> val emp2 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop")).toDS()
> val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
> emp1.join(emp3, emp1.col("key") === emp3.col("key"), 
> "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show()
> // wrong result
> +---+-+---+
> |key|value| e2|
> +---+-+---+
> |  1|sales|  1|
> |  2|personnel|  2|
> |  3|  develop|  3|
> |  4|   IT|  4|
> +---+-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33536) Incorrect join results when joining twice with the same DF

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33536:


Assignee: Apache Spark

> Incorrect join results when joining twice with the same DF
> --
>
> Key: SPARK-33536
> URL: https://issues.apache.org/jira/browse/SPARK-33536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> val emp1 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop"),
>   TestData(4, "IT")).toDS()
> val emp2 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop")).toDS()
> val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
> emp1.join(emp3, emp1.col("key") === emp3.col("key"), 
> "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show()
> // wrong result
> +---+-+---+
> |key|value| e2|
> +---+-+---+
> |  1|sales|  1|
> |  2|personnel|  2|
> |  3|  develop|  3|
> |  4|   IT|  4|
> +---+-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33536) Incorrect join results when joining twice with the same DF

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238173#comment-17238173
 ] 

Apache Spark commented on SPARK-33536:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/30488

> Incorrect join results when joining twice with the same DF
> --
>
> Key: SPARK-33536
> URL: https://issues.apache.org/jira/browse/SPARK-33536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> val emp1 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop"),
>   TestData(4, "IT")).toDS()
> val emp2 = Seq[TestData](
>   TestData(1, "sales"),
>   TestData(2, "personnel"),
>   TestData(3, "develop")).toDS()
> val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
> emp1.join(emp3, emp1.col("key") === emp3.col("key"), 
> "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show()
> // wrong result
> +---+-+---+
> |key|value| e2|
> +---+-+---+
> |  1|sales|  1|
> |  2|personnel|  2|
> |  3|  develop|  3|
> |  4|   IT|  4|
> +---+-+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238170#comment-17238170
 ] 

Apache Spark commented on SPARK-33535:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30487

> export LANG to en_US.UTF-8 in jenkins test script
> -
>
> Key: SPARK-33535
> URL: https://issues.apache.org/jira/browse/SPARK-33535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
>  
> {code:java}
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5
>  get binary type{code}
>  
> failed Jenkins tests  and passed GitHub Actions. The error message as follows:
>  
>  
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not 
> equal "[�]("Stacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {code}
>  
> seems that the "LANG" of some  build machines is not "en_US.UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33535:


Assignee: (was: Apache Spark)

> export LANG to en_US.UTF-8 in jenkins test script
> -
>
> Key: SPARK-33535
> URL: https://issues.apache.org/jira/browse/SPARK-33535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
>  
> {code:java}
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5
>  get binary type{code}
>  
> failed Jenkins tests  and passed GitHub Actions. The error message as follows:
>  
>  
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not 
> equal "[�]("Stacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {code}
>  
> seems that the "LANG" of some  build machines is not "en_US.UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33535:


Assignee: Apache Spark

> export LANG to en_US.UTF-8 in jenkins test script
> -
>
> Key: SPARK-33535
> URL: https://issues.apache.org/jira/browse/SPARK-33535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5
>  get binary type{code}
>  
> failed Jenkins tests  and passed GitHub Actions. The error message as follows:
>  
>  
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not 
> equal "[�]("Stacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {code}
>  
> seems that the "LANG" of some  build machines is not "en_US.UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238169#comment-17238169
 ] 

Apache Spark commented on SPARK-33535:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/30487

> export LANG to en_US.UTF-8 in jenkins test script
> -
>
> Key: SPARK-33535
> URL: https://issues.apache.org/jira/browse/SPARK-33535
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
>  
> {code:java}
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4
>  get binary type
>  
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5
>  get binary type{code}
>  
> failed Jenkins tests  and passed GitHub Actions. The error message as follows:
>  
>  
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not 
> equal "[�]("Stacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�]("
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
>   at 
> org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> {code}
>  
> seems that the "LANG" of some  build machines is not "en_US.UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33530) Support --archives option natively

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238165#comment-17238165
 ] 

Apache Spark commented on SPARK-33530:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30486

> Support --archives option natively
> --
>
> Key: SPARK-33530
> URL: https://issues.apache.org/jira/browse/SPARK-33530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} 
> configuration are only supported in Yarn modes:
> {code}
> spark-submit --help
> ...
>  Spark on YARN only:
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --archives ARCHIVES Comma separated list of archives to be 
> extracted into the
>   working directory of each executor.
> {code}
> This is actually critical for PySpark to support shipping other packages 
> together, see also 
> https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment.
> Due to this missing feature, PySpark cannot support conda env to ship other 
> packages together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33530) Support --archives option natively

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238167#comment-17238167
 ] 

Apache Spark commented on SPARK-33530:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30486

> Support --archives option natively
> --
>
> Key: SPARK-33530
> URL: https://issues.apache.org/jira/browse/SPARK-33530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} 
> configuration are only supported in Yarn modes:
> {code}
> spark-submit --help
> ...
>  Spark on YARN only:
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --archives ARCHIVES Comma separated list of archives to be 
> extracted into the
>   working directory of each executor.
> {code}
> This is actually critical for PySpark to support shipping other packages 
> together, see also 
> https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment.
> Due to this missing feature, PySpark cannot support conda env to ship other 
> packages together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33530) Support --archives option natively

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33530:


Assignee: Apache Spark

> Support --archives option natively
> --
>
> Key: SPARK-33530
> URL: https://issues.apache.org/jira/browse/SPARK-33530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} 
> configuration are only supported in Yarn modes:
> {code}
> spark-submit --help
> ...
>  Spark on YARN only:
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --archives ARCHIVES Comma separated list of archives to be 
> extracted into the
>   working directory of each executor.
> {code}
> This is actually critical for PySpark to support shipping other packages 
> together, see also 
> https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment.
> Due to this missing feature, PySpark cannot support conda env to ship other 
> packages together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33530) Support --archives option natively

2020-11-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33530:


Assignee: (was: Apache Spark)

> Support --archives option natively
> --
>
> Key: SPARK-33530
> URL: https://issues.apache.org/jira/browse/SPARK-33530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} 
> configuration are only supported in Yarn modes:
> {code}
> spark-submit --help
> ...
>  Spark on YARN only:
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --archives ARCHIVES Comma separated list of archives to be 
> extracted into the
>   working directory of each executor.
> {code}
> This is actually critical for PySpark to support shipping other packages 
> together, see also 
> https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment.
> Due to this missing feature, PySpark cannot support conda env to ship other 
> packages together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33530) Support --archives option natively

2020-11-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238166#comment-17238166
 ] 

Apache Spark commented on SPARK-33530:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30486

> Support --archives option natively
> --
>
> Key: SPARK-33530
> URL: https://issues.apache.org/jira/browse/SPARK-33530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} 
> configuration are only supported in Yarn modes:
> {code}
> spark-submit --help
> ...
>  Spark on YARN only:
>   --queue QUEUE_NAME  The YARN queue to submit to (Default: 
> "default").
>   --archives ARCHIVES Comma separated list of archives to be 
> extracted into the
>   working directory of each executor.
> {code}
> This is actually critical for PySpark to support shipping other packages 
> together, see also 
> https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment.
> Due to this missing feature, PySpark cannot support conda env to ship other 
> packages together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33536) Incorrect join results when joining twice with the same DF

2020-11-24 Thread wuyi (Jira)
wuyi created SPARK-33536:


 Summary: Incorrect join results when joining twice with the same DF
 Key: SPARK-33536
 URL: https://issues.apache.org/jira/browse/SPARK-33536
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0, 3.1.0
Reporter: wuyi


{code:java}
val emp1 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop"),
  TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), 
"left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show()

// wrong result
+---+-+---+
|key|value| e2|
+---+-+---+
|  1|sales|  1|
|  2|personnel|  2|
|  3|  develop|  3|
|  4|   IT|  4|
+---+-+---+

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >