[jira] [Assigned] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36776:


Assignee: (was: Apache Spark)

> Partition filter of DataSourceV2ScanRelation can not push down when select 
> none dataSchema from FileScan
> 
>
> Key: SPARK-36776
> URL: https://issues.apache.org/jira/browse/SPARK-36776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Priority: Major
>
> In PruneFileSourcePartitions rule, the FileScan::withFilters is called to 
> push down partition prune filter(and this is the only place this function can 
> be called), but it has a constraint that “scan.readDataSchema.nonEmpty” 
>  [source code 
> here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114]
>  We use spark sql in custom catalog and execute the count sql like: select 
> count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not 
> select any col reference to tbl), in which dt is a partition key.
> In this case the scan.readDataSchema is empty indeed and no scan partition 
> prune performed, which cause scan all partition at last.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36776:


Assignee: Apache Spark

> Partition filter of DataSourceV2ScanRelation can not push down when select 
> none dataSchema from FileScan
> 
>
> Key: SPARK-36776
> URL: https://issues.apache.org/jira/browse/SPARK-36776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Assignee: Apache Spark
>Priority: Major
>
> In PruneFileSourcePartitions rule, the FileScan::withFilters is called to 
> push down partition prune filter(and this is the only place this function can 
> be called), but it has a constraint that “scan.readDataSchema.nonEmpty” 
>  [source code 
> here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114]
>  We use spark sql in custom catalog and execute the count sql like: select 
> count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not 
> select any col reference to tbl), in which dt is a partition key.
> In this case the scan.readDataSchema is empty indeed and no scan partition 
> prune performed, which cause scan all partition at last.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417015#comment-17417015
 ] 

Apache Spark commented on SPARK-36776:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34037

> Partition filter of DataSourceV2ScanRelation can not push down when select 
> none dataSchema from FileScan
> 
>
> Key: SPARK-36776
> URL: https://issues.apache.org/jira/browse/SPARK-36776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Priority: Major
>
> In PruneFileSourcePartitions rule, the FileScan::withFilters is called to 
> push down partition prune filter(and this is the only place this function can 
> be called), but it has a constraint that “scan.readDataSchema.nonEmpty” 
>  [source code 
> here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114]
>  We use spark sql in custom catalog and execute the count sql like: select 
> count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not 
> select any col reference to tbl), in which dt is a partition key.
> In this case the scan.readDataSchema is empty indeed and no scan partition 
> prune performed, which cause scan all partition at last.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36796) Make all unit tests pass on Java 17

2021-09-17 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417013#comment-17417013
 ] 

Yang Jie commented on SPARK-36796:
--

i'm working on this

> Make all unit tests pass on Java 17
> ---
>
> Key: SPARK-36796
> URL: https://issues.apache.org/jira/browse/SPARK-36796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36796) Make all unit tests pass on Java 17

2021-09-17 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417013#comment-17417013
 ] 

Yang Jie edited comment on SPARK-36796 at 9/18/21, 4:25 AM:


I'm working on this


was (Author: luciferyang):
i'm working on this

> Make all unit tests pass on Java 17
> ---
>
> Key: SPARK-36796
> URL: https://issues.apache.org/jira/browse/SPARK-36796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36796) Make all unit tests pass on Java 17

2021-09-17 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-36796:
---

 Summary: Make all unit tests pass on Java 17
 Key: SPARK-36796
 URL: https://issues.apache.org/jira/browse/SPARK-36796
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.3.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan

2021-09-17 Thread suheng.cloud (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416987#comment-17416987
 ] 

suheng.cloud commented on SPARK-36776:
--

Thank you Hyukjin & Huaxin~

> Partition filter of DataSourceV2ScanRelation can not push down when select 
> none dataSchema from FileScan
> 
>
> Key: SPARK-36776
> URL: https://issues.apache.org/jira/browse/SPARK-36776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Priority: Major
>
> In PruneFileSourcePartitions rule, the FileScan::withFilters is called to 
> push down partition prune filter(and this is the only place this function can 
> be called), but it has a constraint that “scan.readDataSchema.nonEmpty” 
>  [source code 
> here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114]
>  We use spark sql in custom catalog and execute the count sql like: select 
> count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not 
> select any col reference to tbl), in which dt is a partition key.
> In this case the scan.readDataSchema is empty indeed and no scan partition 
> prune performed, which cause scan all partition at last.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36762) Fix Series.isin when Series has NaN values

2021-09-17 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36762.
---
Fix Version/s: 3.2.0
 Assignee: dgd_contributor
   Resolution: Fixed

Issue resolved by pull request 34005
https://github.com/apache/spark/pull/34005

> Fix Series.isin when Series has NaN values
> --
>
> Key: SPARK-36762
> URL: https://issues.apache.org/jira/browse/SPARK-36762
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0, 3.3.0
>Reporter: dgd_contributor
>Assignee: dgd_contributor
>Priority: Major
> Fix For: 3.2.0
>
>
> {code:python}
> >>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0])
> >>> psser = ps.from_pandas(pser)
> >>> pser.isin([1, 3, 5, None])
> 0False
> 1 True
> 2False
> 3 True
> 4False
> 5 True
> 6False
> 7False
> 8False
> dtype: bool
> >>> psser.isin([1, 3, 5, None])
> 0None 
>   
> 1True
> 2None
> 3True
> 4None
> 5True
> 6None
> 7None
> 8None
> dtype: object
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36776) Partition filter of DataSourceV2ScanRelation can not push down when select none dataSchema from FileScan

2021-09-17 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416944#comment-17416944
 ] 

Huaxin Gao commented on SPARK-36776:


This is fixed in Spark master/3.2 in this PR 
https://github.com/apache/spark/pull/33191. I will open a PR to back port the 
fix in 3.1.

> Partition filter of DataSourceV2ScanRelation can not push down when select 
> none dataSchema from FileScan
> 
>
> Key: SPARK-36776
> URL: https://issues.apache.org/jira/browse/SPARK-36776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Priority: Major
>
> In PruneFileSourcePartitions rule, the FileScan::withFilters is called to 
> push down partition prune filter(and this is the only place this function can 
> be called), but it has a constraint that “scan.readDataSchema.nonEmpty” 
>  [source code 
> here|https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala#L114]
>  We use spark sql in custom catalog and execute the count sql like: select 
> count( * ) from catalog.db.tbl where dt=‘0812’ (also in other sqls if we not 
> select any col reference to tbl), in which dt is a partition key.
> In this case the scan.readDataSchema is empty indeed and no scan partition 
> prune performed, which cause scan all partition at last.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36795:


Assignee: (was: Apache Spark)

> Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
> ---
>
> Key: SPARK-36795
> URL: https://issues.apache.org/jira/browse/SPARK-36795
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Michael Chen
>Priority: Major
>
> When a query contains an InMemoryRelation, the output of Explain Formatted 
> will contain duplicate node IDs.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (14)
> +- == Final Plan ==
>* BroadcastHashJoin Inner BuildLeft (9)
>:- BroadcastQueryStage (5)
>:  +- BroadcastExchange (4)
>: +- * Filter (3)
>:+- * ColumnarToRow (2)
>:   +- InMemoryTableScan (1)
>: +- InMemoryRelation (2)
>:   +- * ColumnarToRow (4)
>:  +- Scan parquet default.t1 (3)
>+- * Filter (8)
>   +- * ColumnarToRow (7)
>  +- Scan parquet default.t2 (6)
> +- == Initial Plan ==
>BroadcastHashJoin Inner BuildLeft (13)
>:- BroadcastExchange (11)
>:  +- Filter (10)
>: +- InMemoryTableScan (1)
>:   +- InMemoryRelation (2)
>: +- * ColumnarToRow (4)
>:+- Scan parquet default.t1 (3)
>+- Filter (12)
>   +- Scan parquet default.t2 (6)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416941#comment-17416941
 ] 

Apache Spark commented on SPARK-36795:
--

User 'ChenMichael' has created a pull request for this issue:
https://github.com/apache/spark/pull/34036

> Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
> ---
>
> Key: SPARK-36795
> URL: https://issues.apache.org/jira/browse/SPARK-36795
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Michael Chen
>Priority: Major
>
> When a query contains an InMemoryRelation, the output of Explain Formatted 
> will contain duplicate node IDs.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (14)
> +- == Final Plan ==
>* BroadcastHashJoin Inner BuildLeft (9)
>:- BroadcastQueryStage (5)
>:  +- BroadcastExchange (4)
>: +- * Filter (3)
>:+- * ColumnarToRow (2)
>:   +- InMemoryTableScan (1)
>: +- InMemoryRelation (2)
>:   +- * ColumnarToRow (4)
>:  +- Scan parquet default.t1 (3)
>+- * Filter (8)
>   +- * ColumnarToRow (7)
>  +- Scan parquet default.t2 (6)
> +- == Initial Plan ==
>BroadcastHashJoin Inner BuildLeft (13)
>:- BroadcastExchange (11)
>:  +- Filter (10)
>: +- InMemoryTableScan (1)
>:   +- InMemoryRelation (2)
>: +- * ColumnarToRow (4)
>:+- Scan parquet default.t1 (3)
>+- Filter (12)
>   +- Scan parquet default.t2 (6)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36795:


Assignee: Apache Spark

> Explain Formatted has Duplicated Node IDs with InMemoryRelation Present
> ---
>
> Key: SPARK-36795
> URL: https://issues.apache.org/jira/browse/SPARK-36795
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Michael Chen
>Assignee: Apache Spark
>Priority: Major
>
> When a query contains an InMemoryRelation, the output of Explain Formatted 
> will contain duplicate node IDs.
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (14)
> +- == Final Plan ==
>* BroadcastHashJoin Inner BuildLeft (9)
>:- BroadcastQueryStage (5)
>:  +- BroadcastExchange (4)
>: +- * Filter (3)
>:+- * ColumnarToRow (2)
>:   +- InMemoryTableScan (1)
>: +- InMemoryRelation (2)
>:   +- * ColumnarToRow (4)
>:  +- Scan parquet default.t1 (3)
>+- * Filter (8)
>   +- * ColumnarToRow (7)
>  +- Scan parquet default.t2 (6)
> +- == Initial Plan ==
>BroadcastHashJoin Inner BuildLeft (13)
>:- BroadcastExchange (11)
>:  +- Filter (10)
>: +- InMemoryTableScan (1)
>:   +- InMemoryRelation (2)
>: +- * ColumnarToRow (4)
>:+- Scan parquet default.t1 (3)
>+- Filter (12)
>   +- Scan parquet default.t2 (6)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36795) Explain Formatted has Duplicated Node IDs with InMemoryRelation Present

2021-09-17 Thread Michael Chen (Jira)
Michael Chen created SPARK-36795:


 Summary: Explain Formatted has Duplicated Node IDs with 
InMemoryRelation Present
 Key: SPARK-36795
 URL: https://issues.apache.org/jira/browse/SPARK-36795
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Michael Chen


When a query contains an InMemoryRelation, the output of Explain Formatted will 
contain duplicate node IDs.


{code:java}
== Physical Plan ==
AdaptiveSparkPlan (14)
+- == Final Plan ==
   * BroadcastHashJoin Inner BuildLeft (9)
   :- BroadcastQueryStage (5)
   :  +- BroadcastExchange (4)
   : +- * Filter (3)
   :+- * ColumnarToRow (2)
   :   +- InMemoryTableScan (1)
   : +- InMemoryRelation (2)
   :   +- * ColumnarToRow (4)
   :  +- Scan parquet default.t1 (3)
   +- * Filter (8)
  +- * ColumnarToRow (7)
 +- Scan parquet default.t2 (6)
+- == Initial Plan ==
   BroadcastHashJoin Inner BuildLeft (13)
   :- BroadcastExchange (11)
   :  +- Filter (10)
   : +- InMemoryTableScan (1)
   :   +- InMemoryRelation (2)
   : +- * ColumnarToRow (4)
   :+- Scan parquet default.t1 (3)
   +- Filter (12)
  +- Scan parquet default.t2 (6)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36793) [K8S] Support write container stdout/stderr to file

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36793:


Assignee: Apache Spark

> [K8S] Support write container stdout/stderr to file 
> 
>
> Key: SPARK-36793
> URL: https://issues.apache.org/jira/browse/SPARK-36793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Zhongwei Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, executor and driver pod only redirect stdout/stderr. If users want 
> to sidecar logging agent to send stdout/stderr to external log storage,  only 
> way is to change entrypoint.sh, which might break compatibility with 
> community version.
> We should support this feature, and this feature could be enabled by spark 
> config. Related spark configs are:
> |Key|Default|Desc|
> |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver 
> stdout/stderr as log file|
> |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write 
> executor/driver stdout/stderr as log file|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36793) [K8S] Support write container stdout/stderr to file

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36793:


Assignee: (was: Apache Spark)

> [K8S] Support write container stdout/stderr to file 
> 
>
> Key: SPARK-36793
> URL: https://issues.apache.org/jira/browse/SPARK-36793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Currently, executor and driver pod only redirect stdout/stderr. If users want 
> to sidecar logging agent to send stdout/stderr to external log storage,  only 
> way is to change entrypoint.sh, which might break compatibility with 
> community version.
> We should support this feature, and this feature could be enabled by spark 
> config. Related spark configs are:
> |Key|Default|Desc|
> |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver 
> stdout/stderr as log file|
> |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write 
> executor/driver stdout/stderr as log file|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36793) [K8S] Support write container stdout/stderr to file

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416896#comment-17416896
 ] 

Apache Spark commented on SPARK-36793:
--

User 'warrenzhu25' has created a pull request for this issue:
https://github.com/apache/spark/pull/34035

> [K8S] Support write container stdout/stderr to file 
> 
>
> Key: SPARK-36793
> URL: https://issues.apache.org/jira/browse/SPARK-36793
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.2
>Reporter: Zhongwei Zhu
>Priority: Minor
>
> Currently, executor and driver pod only redirect stdout/stderr. If users want 
> to sidecar logging agent to send stdout/stderr to external log storage,  only 
> way is to change entrypoint.sh, which might break compatibility with 
> community version.
> We should support this feature, and this feature could be enabled by spark 
> config. Related spark configs are:
> |Key|Default|Desc|
> |Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver 
> stdout/stderr as log file|
> |Spark.kubernetes.logToFile.path|/var/log/spark|The path to write 
> executor/driver stdout/stderr as log file|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36794:


Assignee: (was: Apache Spark)

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416879#comment-17416879
 ] 

Apache Spark commented on SPARK-36794:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34034

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36794:


Assignee: Apache Spark

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-09-17 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-36794:
-
Summary: Ignore duplicated join keys when building relation for SEMI/ANTI 
hash join  (was: Ignore duplicated join keys when building relation for 
LEFT/ANTI hash join)

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Priority: Minor
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36794) Ignore duplicated join keys when building relation for LEFT/ANTI hash join

2021-09-17 Thread Cheng Su (Jira)
Cheng Su created SPARK-36794:


 Summary: Ignore duplicated join keys when building relation for 
LEFT/ANTI hash join
 Key: SPARK-36794
 URL: https://issues.apache.org/jira/browse/SPARK-36794
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Cheng Su


For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
only need to keep one row per unique join key(s) inside hash table 
(`HashedRelation`) when building the hash table. This can help reduce the size 
of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system

2021-09-17 Thread Stavros Kontopoulos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846
 ] 

Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:25 PM:
---

[~xuzhoyin] sorry for the late reply, the local scheme in the past meant local 
in the container, had a different meaning 
(https://github.com/apache/spark/pull/21378). So this was intentional. Not sure 
the status now. Btw regarding the S3 prefix, if I remember correctly the idea 
was not to download files from a remote location locally and then store them 
again eg. S3, this was intended for local files only. Feel free to add any 
other capabilities.  


was (Author: skonto):
[~xuzhoyin] sorry for the late reply, the local scheme in the past meant local 
in the container, had a different meaning 
(https://github.com/apache/spark/pull/21378). So this was intentional. Not sure 
the status now. Btw i I remember correctly the idea was not to download files 
from a remote location locally and then store them again eg. S3. 

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system

2021-09-17 Thread Stavros Kontopoulos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846
 ] 

Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:23 PM:
---

[~xuzhoyin] sorry for the late reply, the local scheme in the past meant local 
in the container, had a different meaning 
(https://github.com/apache/spark/pull/21378). So this was intentional. Not sure 
the status now. Btw i I remember correctly the idea was not to download files 
from a remote location locally and then store them again eg. S3. 


was (Author: skonto):
[~xuzhoyin] sorry for the late reply, the local scheme in the past meant local 
in the container, had a different meaning 
(https://github.com/apache/spark/pull/21378). So this was intentional. Not sure 
the status now.

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23153) Support application dependencies in submission client's local file system

2021-09-17 Thread Stavros Kontopoulos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846
 ] 

Stavros Kontopoulos edited comment on SPARK-23153 at 9/17/21, 6:21 PM:
---

[~xuzhoyin] sorry for the late reply, the local scheme in the past meant local 
in the container, had a different meaning 
(https://github.com/apache/spark/pull/21378). So this was intentional. Not sure 
the status now.


was (Author: skonto):
[~xuzhoyin] sorry for the late reply, the local scheme in the past meant local 
in the container, had a different meaning 
(https://github.com/apache/spark/pull/21378). Not sure the status now.

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23153) Support application dependencies in submission client's local file system

2021-09-17 Thread Stavros Kontopoulos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416846#comment-17416846
 ] 

Stavros Kontopoulos commented on SPARK-23153:
-

[~xuzhoyin] sorry for the late reply, the local scheme in the past meant local 
in the container, had a different meaning 
(https://github.com/apache/spark/pull/21378). Not sure the status now.

> Support application dependencies in submission client's local file system
> -
>
> Key: SPARK-23153
> URL: https://issues.apache.org/jira/browse/SPARK-23153
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently local dependencies are not supported with Spark on K8S i.e. if the 
> user has code or dependencies only on the client where they run 
> {{spark-submit}} then the current implementation has no way to make those 
> visible to the Spark application running inside the K8S pods that get 
> launched.  This limits users to only running applications where the code and 
> dependencies are either baked into the Docker images used or where those are 
> available via some external and globally accessible file system e.g. HDFS 
> which are not viable options for many users and environments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36793) [K8S] Support write container stdout/stderr to file

2021-09-17 Thread Zhongwei Zhu (Jira)
Zhongwei Zhu created SPARK-36793:


 Summary: [K8S] Support write container stdout/stderr to file 
 Key: SPARK-36793
 URL: https://issues.apache.org/jira/browse/SPARK-36793
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.1.2
Reporter: Zhongwei Zhu


Currently, executor and driver pod only redirect stdout/stderr. If users want 
to sidecar logging agent to send stdout/stderr to external log storage,  only 
way is to change entrypoint.sh, which might break compatibility with community 
version.

We should support this feature, and this feature could be enabled by spark 
config. Related spark configs are:
|Key|Default|Desc|
|Spark.kubernetes.logToFile.enabled|false|Whether to write executor/driver 
stdout/stderr as log file|
|Spark.kubernetes.logToFile.path|/var/log/spark|The path to write 
executor/driver stdout/stderr as log file|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36792:


Assignee: Apache Spark

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416799#comment-17416799
 ] 

Apache Spark commented on SPARK-36792:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34033

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36792:


Assignee: (was: Apache Spark)

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36673) Incorrect Unions of struct with mismatched field name case

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416782#comment-17416782
 ] 

Apache Spark commented on SPARK-36673:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/34032

> Incorrect Unions of struct with mismatched field name case
> --
>
> Key: SPARK-36673
> URL: https://issues.apache.org/jira/browse/SPARK-36673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Shardul Mahadik
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> If a nested field has different casing on two sides of the union, the 
> resultant schema of the union will both fields in its schemaa
> {code:java}
> scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
> INNER")))
> df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
> df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> scala> df1.union(df2).printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- INNER: long (nullable = false)
>  ||-- inner: long (nullable = false)
>  {code}
> This seems like a bug. I would expect that Spark SQL would either just union 
> by index or if the user has requested {{unionByName}}, then it should matched 
> fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}.
> However the output data only has one nested column
> {code:java}
> scala> df1.union(df2).show()
> +---+--+
> | id|nested|
> +---+--+
> |  0|   {0}|
> |  1|   {5}|
> |  0|   {0}|
> |  1|   {5}|
> +---+--+
> {code}
> Trying to project fields of {{nested}} throws an error:
> {code:java}
> scala> df1.union(df2).select("nested.*").show()
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> 

[jira] [Updated] (SPARK-33772) Build and Run Spark on Java 17

2021-09-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33772:
--
Labels: releasenotes  (was: )

> Build and Run Spark on Java 17
> --
>
> Key: SPARK-33772
> URL: https://issues.apache.org/jira/browse/SPARK-33772
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>
> Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is 
> 17.
> ||Version||Release Date||
> |Java 17 (LTS)|September 2021|
> Apache Spark has a release plan and `Spark 3.2 Code freeze` was July along 
> with the release branch cut.
> - https://spark.apache.org/versioning-policy.html
> Supporting new Java version is considered as a new feature which we cannot 
> allow to backport.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36772) FinalizeShuffleMerge fails with an exception due to attempt id not matching

2021-09-17 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-36772:
--
Target Version/s: 3.2.0

> FinalizeShuffleMerge fails with an exception due to attempt id not matching
> ---
>
> Key: SPARK-36772
> URL: https://issues.apache.org/jira/browse/SPARK-36772
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Priority: Blocker
>
> As part of driver request to external shuffle services (ESS) to finalize the 
> merge, it also passes its [application attempt 
> id|https://github.com/apache/spark/blob/3f09093a21306b0fbcb132d4c9f285e56ac6b43c/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalBlockStoreClient.java#L180]
>  so that ESS can validate the request is from the correct attempt.
> This attempt id is fetched from the TransportConf passed in when creating the 
> [ExternalBlockStoreClient|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkEnv.scala#L352]
>  - and the transport conf leverages a [cloned 
> copy|https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/core/src/main/scala/org/apache/spark/network/netty/SparkTransportConf.scala#L47]
>  of the SparkConf passed to it.
> Application attempt id is set as part of SparkContext 
> [initialization|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkContext.scala#L586].
> But this happens after driver SparkEnv has [already been 
> created|https://github.com/apache/spark/blob/67421d80b8935d91b86e8cd3becb211fa2abd54f/core/src/main/scala/org/apache/spark/SparkContext.scala#L460].
> Hence the attempt id that ExternalBlockStoreClient uses will always end up 
> being -1 : which will not match the attempt id at ESS (which is based on 
> spark.app.attempt.id) : resulting in merge finalization to always fail (" 
> java.lang.IllegalArgumentException: The attempt id -1 in this 
> FinalizeShuffleMerge message does not match with the current attempt id 1 
> stored in shuffle service for application ...")



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-17 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416743#comment-17416743
 ] 

angerszhu commented on SPARK-36792:
---

raise a pr soon

> Inset should handle Double.NaN and Float.NaN
> 
>
> Key: SPARK-36792
> URL: https://issues.apache.org/jira/browse/SPARK-36792
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36792) Inset should handle Double.NaN and Float.NaN

2021-09-17 Thread angerszhu (Jira)
angerszhu created SPARK-36792:
-

 Summary: Inset should handle Double.NaN and Float.NaN
 Key: SPARK-36792
 URL: https://issues.apache.org/jira/browse/SPARK-36792
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.2, 3.0.2, 3.2.0
Reporter: angerszhu


Inset(Double.Nan, Seq(DOuble.NaN, 1d)) return false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36663:
---

Assignee: Kousuke Saruta

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint")
> {quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-56-28-846.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36663) When the existing field name is a number, an error will be reported when reading the orc file

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36663.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33915
[https://github.com/apache/spark/pull/33915]

> When the existing field name is a number, an error will be reported when 
> reading the orc file
> -
>
> Key: SPARK-36663
> URL: https://issues.apache.org/jira/browse/SPARK-36663
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2
>Reporter: mcdull_zhang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: image-2021-09-03-20-56-28-846.png
>
>
> You can use the following methods to reproduce the problem:
> {quote}val path = "file:///tmp/test_orc"
> spark.range(1).withColumnRenamed("id", "100").repartition(1).write.orc(path)
> spark.read.orc(path)
> {quote}
> The error message is like this:
> {quote}org.apache.spark.sql.catalyst.parser.ParseException:
>  mismatched input '100' expecting {'ADD', 'AFTER'
> == SQL ==
>  struct<100:bigint>
>  ---^^^
> {quote}
> The error is actually issued by this line of code:
> {quote}CatalystSqlParser.parseDataType("100:bigint")
> {quote}
>  
> The specific background is that spark calls the above code in the process of 
> converting the schema of the orc file into the catalyst schema.
> {quote}// code in OrcUtils
>  private def toCatalystSchema(schema: TypeDescription): StructType =
> Unknown macro: \{  
> CharVarcharUtils.replaceCharVarcharWithStringInSchema(CatalystSqlParser.parseDataType(schema.toString).asInstanceOf[StructType])
>  }{quote}
> There are two solutions I currently think of:
>  # Modify the syntax analysis of SparkSQL to identify this kind of schema
>  # The TypeDescription.toString method should add the quote symbol to the 
> numeric column name, because the following syntax is supported:
> {quote}CatalystSqlParser.parseDataType("`100`:bigint")
> {quote}
> But currently TypeDescription does not support changing the UNQUOTED_NAMES 
> variable, should we first submit a pr to the orc project to support the 
> configuration of this variable。
> !image-2021-09-03-20-56-28-846.png!
>  
> How do spark members think about this issue?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36767) ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36767.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34008
[https://github.com/apache/spark/pull/34008]

>  ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT
> -
>
> Key: SPARK-36767
> URL: https://issues.apache.org/jira/browse/SPARK-36767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36767) ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36767:
---

Assignee: angerszhu

>  ArrayMin/ArrayMax/SortArray/ArraySort add comment and UT
> -
>
> Key: SPARK-36767
> URL: https://issues.apache.org/jira/browse/SPARK-36767
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36673) Incorrect Unions of struct with mismatched field name case

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36673:
---

Assignee: L. C. Hsieh

> Incorrect Unions of struct with mismatched field name case
> --
>
> Key: SPARK-36673
> URL: https://issues.apache.org/jira/browse/SPARK-36673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Shardul Mahadik
>Assignee: L. C. Hsieh
>Priority: Major
>
> If a nested field has different casing on two sides of the union, the 
> resultant schema of the union will both fields in its schemaa
> {code:java}
> scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
> INNER")))
> df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
> df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> scala> df1.union(df2).printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- INNER: long (nullable = false)
>  ||-- inner: long (nullable = false)
>  {code}
> This seems like a bug. I would expect that Spark SQL would either just union 
> by index or if the user has requested {{unionByName}}, then it should matched 
> fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}.
> However the output data only has one nested column
> {code:java}
> scala> df1.union(df2).show()
> +---+--+
> | id|nested|
> +---+--+
> |  0|   {0}|
> |  1|   {5}|
> |  0|   {0}|
> |  1|   {5}|
> +---+--+
> {code}
> Trying to project fields of {{nested}} throws an error:
> {code:java}
> scala> df1.union(df2).select("nested.*").show()
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:321)
>   at 
> 

[jira] [Resolved] (SPARK-36673) Incorrect Unions of struct with mismatched field name case

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36673.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34025
[https://github.com/apache/spark/pull/34025]

> Incorrect Unions of struct with mismatched field name case
> --
>
> Key: SPARK-36673
> URL: https://issues.apache.org/jira/browse/SPARK-36673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Shardul Mahadik
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> If a nested field has different casing on two sides of the union, the 
> resultant schema of the union will both fields in its schemaa
> {code:java}
> scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
> INNER")))
> df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
> df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> scala> df1.union(df2).printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- INNER: long (nullable = false)
>  ||-- inner: long (nullable = false)
>  {code}
> This seems like a bug. I would expect that Spark SQL would either just union 
> by index or if the user has requested {{unionByName}}, then it should matched 
> fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}.
> However the output data only has one nested column
> {code:java}
> scala> df1.union(df2).show()
> +---+--+
> | id|nested|
> +---+--+
> |  0|   {0}|
> |  1|   {5}|
> |  0|   {0}|
> |  1|   {5}|
> +---+--+
> {code}
> Trying to project fields of {{nested}} throws an error:
> {code:java}
> scala> df1.union(df2).select("nested.*").show()
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>   at 
> 

[jira] [Resolved] (SPARK-36718) only collapse projects if we don't duplicate expensive expressions

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36718.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33958
[https://github.com/apache/spark/pull/33958]

> only collapse projects if we don't duplicate expensive expressions
> --
>
> Key: SPARK-36718
> URL: https://issues.apache.org/jira/browse/SPARK-36718
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36718) only collapse projects if we don't duplicate expensive expressions

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36718:
---

Assignee: Wenchen Fan

> only collapse projects if we don't duplicate expensive expressions
> --
>
> Key: SPARK-36718
> URL: https://issues.apache.org/jira/browse/SPARK-36718
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36764) Fix race-condition on "ensure continuous stream is being used" in KafkaContinuousTest

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36764.
-
Fix Version/s: 3.2.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

> Fix race-condition on "ensure continuous stream is being used" in 
> KafkaContinuousTest
> -
>
> Key: SPARK-36764
> URL: https://issues.apache.org/jira/browse/SPARK-36764
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8, 3.0.3, 3.1.2, 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
>
> The test “ensure continuous stream is being used“ in 
> KafkaContinuousTestquickly checks the actual type of the execution, and stop 
> the query. Stopping the streaming query in continuous mode is done by 
> interrupting query execution thread and join indefinitely.
> In parallel, started streaming query is going to generate execution plan, 
> including running optimizer. Some parts of SessionState can be built at that 
> time, as they are defined as lazy. The problem is, some of them seem to be 
> able to “swallow” the InterruptedException and let the thread run 
> continuously.
> That said, the query can’t indicate whether there is a request on stopping 
> query, so the query won’t stop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36741) array_distinct should not return duplicated NaN

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36741.
-
Fix Version/s: 3.1.3
   3.2.0
   3.0.4
   Resolution: Fixed

Issue resolved by pull request 33993
[https://github.com/apache/spark/pull/33993]

> array_distinct should not return duplicated NaN
> ---
>
> Key: SPARK-36741
> URL: https://issues.apache.org/jira/browse/SPARK-36741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.0.4, 3.2.0, 3.1.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36741) array_distinct should not return duplicated NaN

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36741:
---

Assignee: angerszhu

> array_distinct should not return duplicated NaN
> ---
>
> Key: SPARK-36741
> URL: https://issues.apache.org/jira/browse/SPARK-36741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-17 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656
 ] 

Manu Zhang commented on SPARK-31646:


[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in 

ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-17 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656
 ] 

Manu Zhang edited comment on SPARK-31646 at 9/17/21, 12:40 PM:
---

[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.


was (Author: mauzhang):
[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in 

ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36765) Spark Support for MS Sql JDBC connector with Kerberos/Keytab

2021-09-17 Thread Jakub Pawlowski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416635#comment-17416635
 ] 

Jakub Pawlowski commented on SPARK-36765:
-

As per documentation on JDBC driver, sqljdbc_auth lib should not be needed and 
authentication should happen using pure java libraries. This library was needed 
only for older versions of the driver.

[https://docs.microsoft.com/en-us/sql/connect/jdbc/using-kerberos-integrated-authentication-to-connect-to-sql-server?view=sql-server-ver15]

I could make it a vanilla java code work, but spark is creating 
programmatically the jaas configuration so maybe that's where something gets 
broken..?

> Spark Support for MS Sql JDBC connector with Kerberos/Keytab
> 
>
> Key: SPARK-36765
> URL: https://issues.apache.org/jira/browse/SPARK-36765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Unix Redhat Environment
>Reporter: Dilip Thallam Sridhar
>Priority: Major
> Fix For: 3.1.2
>
>
> Hi Team,
>  
> We are using the Spark-3.0.2 to connect to MS SqlServer with the following 
> instruction  
> Also tried with the Spark-3.1.2 Version,
>  
>  1) download mssql-jdbc-9.4.0.jre8.jar
>  2) Generated Keytab using kinit
>  3) Validate Keytab using klist
>  4) Run the spark job with jdbc_library, principal and keytabs passed
> .config("spark.driver.extraClassPath", spark_jar_lib) \
> .config("spark.executor.extraClassPath", spark_jar_lib) \
>  5) connection_url = 
> "jdbc:sqlserver://{}:{};databaseName={};integratedSecurity=true;authenticationSchema=JavaKerberos"\
>  .format(jdbc_host_name, jdbc_port, jdbc_database_name)
> Note: without integratedSecurity=true;authenticationSchema=JavaKerberos it 
> looks for the usual username/password option to connect
> 6) passing the following options during spark read.
>  .option("principal", database_principal) \
>  .option("files", database_keytab) \
>  .option("keytab", database_keytab) \
>   
>  tried with files and keytab, just files, and with all above 3 parameters
>   
>  We are unable to connect to SqlServer from Spark and getting the following 
> error shown below. 
>   
>  A) Wanted to know if anybody was successful Spark to SqlServer? (as I see 
> the previous Jira has been closed)
>  https://issues.apache.org/jira/browse/SPARK-12312
>  https://issues.apache.org/jira/browse/SPARK-31337
>   
>  B) If yes, could you let us know if there are any additional configs needed 
> for Spark to connect to SqlServer please?
>  Appreciate if we can get inputs to resolve this error.
>   
>   
>  Full Stack Trace
> {code}
> Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is 
> not configured for integrated authentication. at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1352)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:2329)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:1905)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:41)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:1893)
>  at 
> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4575) 
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1400)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1045)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:817)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:700)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:842)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.SecureConnectionProvider.getConnection(SecureConnectionProvider.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider.org$apache$spark$sql$execution$datasources$jdbc$connection$MSSQLConnectionProvider$$super$getConnection(MSSQLConnectionProvider.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider$$anon$1.run(MSSQLConnectionProvider.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider$$anon$1.run(MSSQLConnectionProvider.scala:67)
>  at 

[jira] [Assigned] (SPARK-36778) Support ILIKE API on Scala(dataframe)

2021-09-17 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-36778:


Assignee: Leona Yoda

> Support ILIKE API on Scala(dataframe)
> -
>
> Key: SPARK-36778
> URL: https://issues.apache.org/jira/browse/SPARK-36778
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Major
>
> Support Scala(dataframe) API on ILIKE (case sensitive LIKE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36778) Support ILIKE API on Scala(dataframe)

2021-09-17 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-36778.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34027
[https://github.com/apache/spark/pull/34027]

> Support ILIKE API on Scala(dataframe)
> -
>
> Key: SPARK-36778
> URL: https://issues.apache.org/jira/browse/SPARK-36778
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Major
> Fix For: 3.3.0
>
>
> Support Scala(dataframe) API on ILIKE (case sensitive LIKE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36791:


Assignee: Apache Spark

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.2, 3.2.0
>
> Attachments: error_message.png
>
>
> {code:java}
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416612#comment-17416612
 ] 

Apache Spark commented on SPARK-36791:
--

User 'jiaoqingbo' has created a pull request for this issue:
https://github.com/apache/spark/pull/34031

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2, 3.2.0
>
> Attachments: error_message.png
>
>
> {code:java}
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36791:


Assignee: (was: Apache Spark)

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2, 3.2.0
>
> Attachments: error_message.png
>
>
> {code:java}
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Target Version/s: 3.1.2, 3.2.0  (was: 3.1.2)

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2, 3.2.0
>
> Attachments: error_message.png
>
>
> {code:java}
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Description: 
{code:java}
NOTE: you need to replace  and  with actual value

the JHS_POST should be JHS_HOST

{code}

  was:
{code:java}
NOTE: you need to replace  and  with actual value

the JHS_POST should be JHS_HOST



{code}


> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2
>
> Attachments: error_message.png
>
>
> {code:java}
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Fix Version/s: 3.2.0

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2, 3.2.0
>
> Attachments: error_message.png
>
>
> {code:java}
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Description: 
{code:java}
NOTE: you need to replace  and  with actual value

the JHS_POST should be JHS_HOST



{code}

  was:
{code:java}
// code placeholder

NOTE: you need to replace  and  with actual value

the JHS_POST should be JHS_HOST



{code}


> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2
>
> Attachments: error_message.png
>
>
> {code:java}
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Description: 
{code:java}
// code placeholder

NOTE: you need to replace  and  with actual value

the JHS_POST should be JHS_HOST



{code}

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2
>
> Attachments: error_message.png
>
>
> {code:java}
> // code placeholder
> NOTE: you need to replace  and  with actual value
> the JHS_POST should be JHS_HOST
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Attachment: error_message.png

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2
>
> Attachments: error_message.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Attachment: 微信截图_20210917181324.png

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qingbo jiao updated SPARK-36791:

Attachment: (was: 微信截图_20210917181324.png)

> this is a spelling mistakes in running-on-yarn.md file where  JHS_POST should 
> be JHS_HOST
> -
>
> Key: SPARK-36791
> URL: https://issues.apache.org/jira/browse/SPARK-36791
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.1.2, 3.2.0
>Reporter: qingbo jiao
>Priority: Minor
> Fix For: 3.1.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36791) this is a spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

2021-09-17 Thread qingbo jiao (Jira)
qingbo jiao created SPARK-36791:
---

 Summary: this is a spelling mistakes in running-on-yarn.md file 
where  JHS_POST should be JHS_HOST
 Key: SPARK-36791
 URL: https://issues.apache.org/jira/browse/SPARK-36791
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.1.2, 3.2.0
Reporter: qingbo jiao
 Fix For: 3.1.2






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic

2021-09-17 Thread Tongwei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-36727:

Priority: Major  (was: Minor)

> Support sql overwrite a path that is also being read from when 
> partitionOverwriteMode is dynamic
> 
>
> Key: SPARK-36727
> URL: https://issues.apache.org/jira/browse/SPARK-36727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tongwei
>Priority: Major
>
> {code:java}
> // non-partitioned table overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
> INSERT OVERWRITE TABLE tbl SELECT 0,1;
> INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;
> // partitioned table static overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 
> INT);
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
> p1=2021;
> {code}
> When we run the above query, an error will be throwed "Cannot overwrite a 
> path that is also being read from"
> We need to support this operation when the 
> spark.sql.sources.partitionOverwriteMode is dynamic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36789) use the correct constant type as the null value holder in array functions

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36789.
-
Fix Version/s: 3.0.4
   3.1.3
   3.2.0
   Resolution: Fixed

> use the correct constant type as the null value holder in array functions
> -
>
> Key: SPARK-36789
> URL: https://issues.apache.org/jira/browse/SPARK-36789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36765) Spark Support for MS Sql JDBC connector with Kerberos/Keytab

2021-09-17 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416519#comment-17416519
 ] 

Gabor Somogyi commented on SPARK-36765:
---

It was long time ago when I've done that and AFAIR it took me almost a month to 
make it work so definitely a horror task!
My knowledge is cloudy because it was not yesterday but I remember something 
like this:

The exception generally indicates that the driver can not find the appropriate 
sqljdbc_auth lib in the JVM library path.  To correct the problem, one can use 
use the java -D option to specify the "java.library.path" system property 
value.  Worth to mention full path must be set as path, otherwise it was not 
working.

All in all I've faced at least 5-6 different issues which were extremely hard 
to address. Hope others need less time to solve the issues.


> Spark Support for MS Sql JDBC connector with Kerberos/Keytab
> 
>
> Key: SPARK-36765
> URL: https://issues.apache.org/jira/browse/SPARK-36765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Unix Redhat Environment
>Reporter: Dilip Thallam Sridhar
>Priority: Major
> Fix For: 3.1.2
>
>
> Hi Team,
>  
> We are using the Spark-3.0.2 to connect to MS SqlServer with the following 
> instruction  
> Also tried with the Spark-3.1.2 Version,
>  
>  1) download mssql-jdbc-9.4.0.jre8.jar
>  2) Generated Keytab using kinit
>  3) Validate Keytab using klist
>  4) Run the spark job with jdbc_library, principal and keytabs passed
> .config("spark.driver.extraClassPath", spark_jar_lib) \
> .config("spark.executor.extraClassPath", spark_jar_lib) \
>  5) connection_url = 
> "jdbc:sqlserver://{}:{};databaseName={};integratedSecurity=true;authenticationSchema=JavaKerberos"\
>  .format(jdbc_host_name, jdbc_port, jdbc_database_name)
> Note: without integratedSecurity=true;authenticationSchema=JavaKerberos it 
> looks for the usual username/password option to connect
> 6) passing the following options during spark read.
>  .option("principal", database_principal) \
>  .option("files", database_keytab) \
>  .option("keytab", database_keytab) \
>   
>  tried with files and keytab, just files, and with all above 3 parameters
>   
>  We are unable to connect to SqlServer from Spark and getting the following 
> error shown below. 
>   
>  A) Wanted to know if anybody was successful Spark to SqlServer? (as I see 
> the previous Jira has been closed)
>  https://issues.apache.org/jira/browse/SPARK-12312
>  https://issues.apache.org/jira/browse/SPARK-31337
>   
>  B) If yes, could you let us know if there are any additional configs needed 
> for Spark to connect to SqlServer please?
>  Appreciate if we can get inputs to resolve this error.
>   
>   
>  Full Stack Trace
> {code}
> Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: This driver is 
> not configured for integrated authentication. at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1352)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:2329)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:1905)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:41)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:1893)
>  at 
> com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4575) 
> at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1400)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:1045)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:817)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:700)
>  at 
> com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:842)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.SecureConnectionProvider.getConnection(SecureConnectionProvider.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider.org$apache$spark$sql$execution$datasources$jdbc$connection$MSSQLConnectionProvider$$super$getConnection(MSSQLConnectionProvider.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.connection.MSSQLConnectionProvider$$anon$1.run(MSSQLConnectionProvider.scala:69)
>  at 
> 

[jira] [Commented] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416490#comment-17416490
 ] 

Apache Spark commented on SPARK-36790:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/34030

> Update user-facing catalog to adapt CatalogPlugin
> -
>
> Key: SPARK-36790
> URL: https://issues.apache.org/jira/browse/SPARK-36790
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Minor
> Fix For: 3.3.0
>
>
> At now the SparkSession.catalog always retuan a CatalogImpl with a 
> SessionCatalog that is SparkSession.sessionState.catalog
> {code:java}
> @transient lazy val catalog: Catalog = new CatalogImpl(self)
> {code}
> {code:java}
> private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog
> {code}
> So we can do the action is just based the SessionCatalog, we could not do 
> action based user-defined CatalogPlugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36790:


Assignee: Apache Spark

> Update user-facing catalog to adapt CatalogPlugin
> -
>
> Key: SPARK-36790
> URL: https://issues.apache.org/jira/browse/SPARK-36790
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> At now the SparkSession.catalog always retuan a CatalogImpl with a 
> SessionCatalog that is SparkSession.sessionState.catalog
> {code:java}
> @transient lazy val catalog: Catalog = new CatalogImpl(self)
> {code}
> {code:java}
> private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog
> {code}
> So we can do the action is just based the SessionCatalog, we could not do 
> action based user-defined CatalogPlugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36790:


Assignee: (was: Apache Spark)

> Update user-facing catalog to adapt CatalogPlugin
> -
>
> Key: SPARK-36790
> URL: https://issues.apache.org/jira/browse/SPARK-36790
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Minor
> Fix For: 3.3.0
>
>
> At now the SparkSession.catalog always retuan a CatalogImpl with a 
> SessionCatalog that is SparkSession.sessionState.catalog
> {code:java}
> @transient lazy val catalog: Catalog = new CatalogImpl(self)
> {code}
> {code:java}
> private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog
> {code}
> So we can do the action is just based the SessionCatalog, we could not do 
> action based user-defined CatalogPlugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32709.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33432
[https://github.com/apache/spark/pull/33432]

> Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
> --
>
> Key: SPARK-32709
> URL: https://issues.apache.org/jira/browse/SPARK-32709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: 91275701_stage6_metrics.png
>
>
> Hive ORC/Parquet write code path is same as data source v1 code path 
> (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet 
> bucketed table with hivehash. The change is to custom `bucketIdExpression` to 
> use hivehash when the table is Hive bucketed table, and the Hive version is 
> 1.x.y or 2.x.y.
>  
> This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and 
> 2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)

2021-09-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32709:
---

Assignee: Cheng Su

> Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
> --
>
> Key: SPARK-32709
> URL: https://issues.apache.org/jira/browse/SPARK-32709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Attachments: 91275701_stage6_metrics.png
>
>
> Hive ORC/Parquet write code path is same as data source v1 code path 
> (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet 
> bucketed table with hivehash. The change is to custom `bucketIdExpression` to 
> use hivehash when the table is Hive bucketed table, and the Hive version is 
> 1.x.y or 2.x.y.
>  
> This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and 
> 2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin

2021-09-17 Thread PengLei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

PengLei updated SPARK-36790:

Description: 
At now the SparkSession.catalog always retuan a CatalogImpl with a 
SessionCatalog that is SparkSession.sessionState.catalog
{code:java}
@transient lazy val catalog: Catalog = new CatalogImpl(self)
{code}
{code:java}
private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog
{code}
So we can do the action is just based the SessionCatalog, we could not do 
action based user-defined CatalogPlugin.

> Update user-facing catalog to adapt CatalogPlugin
> -
>
> Key: SPARK-36790
> URL: https://issues.apache.org/jira/browse/SPARK-36790
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: PengLei
>Priority: Minor
> Fix For: 3.3.0
>
>
> At now the SparkSession.catalog always retuan a CatalogImpl with a 
> SessionCatalog that is SparkSession.sessionState.catalog
> {code:java}
> @transient lazy val catalog: Catalog = new CatalogImpl(self)
> {code}
> {code:java}
> private def sessionCatalog: SessionCatalog = sparkSession.sessionState.catalog
> {code}
> So we can do the action is just based the SessionCatalog, we could not do 
> action based user-defined CatalogPlugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36790) Update user-facing catalog to adapt CatalogPlugin

2021-09-17 Thread PengLei (Jira)
PengLei created SPARK-36790:
---

 Summary: Update user-facing catalog to adapt CatalogPlugin
 Key: SPARK-36790
 URL: https://issues.apache.org/jira/browse/SPARK-36790
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.3.0
Reporter: PengLei
 Fix For: 3.3.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36789) use the correct constant type as the null value holder in array functions

2021-09-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416476#comment-17416476
 ] 

Apache Spark commented on SPARK-36789:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34029

> use the correct constant type as the null value holder in array functions
> -
>
> Key: SPARK-36789
> URL: https://issues.apache.org/jira/browse/SPARK-36789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36789) use the correct constant type as the null value holder in array functions

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36789:


Assignee: Wenchen Fan  (was: Apache Spark)

> use the correct constant type as the null value holder in array functions
> -
>
> Key: SPARK-36789
> URL: https://issues.apache.org/jira/browse/SPARK-36789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36789) use the correct constant type as the null value holder in array functions

2021-09-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36789:


Assignee: Apache Spark  (was: Wenchen Fan)

> use the correct constant type as the null value holder in array functions
> -
>
> Key: SPARK-36789
> URL: https://issues.apache.org/jira/browse/SPARK-36789
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36789) use the correct constant type as the null value holder in array functions

2021-09-17 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-36789:
---

 Summary: use the correct constant type as the null value holder in 
array functions
 Key: SPARK-36789
 URL: https://issues.apache.org/jira/browse/SPARK-36789
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org