[jira] [Assigned] (SPARK-42829) Added Identifier to the cached RDD operator on the Stages page

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42829:


Assignee: Apache Spark

> Added Identifier to the cached RDD operator on the Stages page 
> ---
>
> Key: SPARK-42829
> URL: https://issues.apache.org/jira/browse/SPARK-42829
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.3.2
>Reporter: Yian Liou
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2023-03-20 at 3.55.40 PM.png
>
>
> On the stages page in the Web UI, there is no distinction for which cached 
> RDD is being executed in a particular stage. This Jira aims to add an repeat 
> identifier to distinguish which cached RDD is being executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42829) Added Identifier to the cached RDD operator on the Stages page

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703039#comment-17703039
 ] 

Apache Spark commented on SPARK-42829:
--

User 'yliou' has created a pull request for this issue:
https://github.com/apache/spark/pull/40502

> Added Identifier to the cached RDD operator on the Stages page 
> ---
>
> Key: SPARK-42829
> URL: https://issues.apache.org/jira/browse/SPARK-42829
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.3.2
>Reporter: Yian Liou
>Priority: Major
> Attachments: Screen Shot 2023-03-20 at 3.55.40 PM.png
>
>
> On the stages page in the Web UI, there is no distinction for which cached 
> RDD is being executed in a particular stage. This Jira aims to add an repeat 
> identifier to distinguish which cached RDD is being executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42829) Added Identifier to the cached RDD operator on the Stages page

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42829:


Assignee: (was: Apache Spark)

> Added Identifier to the cached RDD operator on the Stages page 
> ---
>
> Key: SPARK-42829
> URL: https://issues.apache.org/jira/browse/SPARK-42829
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.3.2
>Reporter: Yian Liou
>Priority: Major
> Attachments: Screen Shot 2023-03-20 at 3.55.40 PM.png
>
>
> On the stages page in the Web UI, there is no distinction for which cached 
> RDD is being executed in a particular stage. This Jira aims to add an repeat 
> identifier to distinguish which cached RDD is being executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42829) Added Identifier to the cached RDD operator on the Stages page

2023-03-20 Thread Yian Liou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703038#comment-17703038
 ] 

Yian Liou commented on SPARK-42829:
---

Opened PR at [https://github.com/apache/spark/pull/40502] and included 
screenshot there. [~gurwls223] 

> Added Identifier to the cached RDD operator on the Stages page 
> ---
>
> Key: SPARK-42829
> URL: https://issues.apache.org/jira/browse/SPARK-42829
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.3.2
>Reporter: Yian Liou
>Priority: Major
> Attachments: Screen Shot 2023-03-20 at 3.55.40 PM.png
>
>
> On the stages page in the Web UI, there is no distinction for which cached 
> RDD is being executed in a particular stage. This Jira aims to add an repeat 
> identifier to distinguish which cached RDD is being executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng resolved SPARK-42864.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40500
[https://github.com/apache/spark/pull/40500]

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703026#comment-17703026
 ] 

Apache Spark commented on SPARK-42864:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40501

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703017#comment-17703017
 ] 

Apache Spark commented on SPARK-42864:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40500

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17703018#comment-17703018
 ] 

Apache Spark commented on SPARK-42864:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40500

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42864:


Assignee: Apache Spark  (was: Ruifeng Zheng)

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42864:


Assignee: Ruifeng Zheng  (was: Apache Spark)

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42864:
-

Assignee: Ruifeng Zheng

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42875) Fix toPandas to handle timezone and map types properly.

2023-03-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42875.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40497
[https://github.com/apache/spark/pull/40497]

> Fix toPandas to handle timezone and map types properly.
> ---
>
> Key: SPARK-42875
> URL: https://issues.apache.org/jira/browse/SPARK-42875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42875) Fix toPandas to handle timezone and map types properly.

2023-03-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42875:
-

Assignee: Takuya Ueshin

> Fix toPandas to handle timezone and map types properly.
> ---
>
> Key: SPARK-42875
> URL: https://issues.apache.org/jira/browse/SPARK-42875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42877) Implement DataFrame.foreach

2023-03-20 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-42877:


 Summary: Implement DataFrame.foreach
 Key: SPARK-42877
 URL: https://issues.apache.org/jira/browse/SPARK-42877
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Xinrong Meng


Maybe we can leverage UDFs to implement that, such as 
`df.select(udf(*df.schema)).count()`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42876) DataType's physicalDataType should be private[sql]

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42876:


Assignee: Rui Wang  (was: Apache Spark)

> DataType's physicalDataType should be private[sql]
> --
>
> Key: SPARK-42876
> URL: https://issues.apache.org/jira/browse/SPARK-42876
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42876) DataType's physicalDataType should be private[sql]

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42876:


Assignee: Apache Spark  (was: Rui Wang)

> DataType's physicalDataType should be private[sql]
> --
>
> Key: SPARK-42876
> URL: https://issues.apache.org/jira/browse/SPARK-42876
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42876) DataType's physicalDataType should be private[sql]

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702978#comment-17702978
 ] 

Apache Spark commented on SPARK-42876:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/40499

> DataType's physicalDataType should be private[sql]
> --
>
> Key: SPARK-42876
> URL: https://issues.apache.org/jira/browse/SPARK-42876
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42876) DataType's physicalDataType should be private[sql]

2023-03-20 Thread Rui Wang (Jira)
Rui Wang created SPARK-42876:


 Summary: DataType's physicalDataType should be private[sql]
 Key: SPARK-42876
 URL: https://issues.apache.org/jira/browse/SPARK-42876
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42829) Added Identifier to the cached RDD operator on the Stages page

2023-03-20 Thread Yian Liou (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yian Liou updated SPARK-42829:
--
Attachment: Screen Shot 2023-03-20 at 3.55.40 PM.png

> Added Identifier to the cached RDD operator on the Stages page 
> ---
>
> Key: SPARK-42829
> URL: https://issues.apache.org/jira/browse/SPARK-42829
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.3.2
>Reporter: Yian Liou
>Priority: Major
> Attachments: Screen Shot 2023-03-20 at 3.55.40 PM.png
>
>
> On the stages page in the Web UI, there is no distinction for which cached 
> RDD is being executed in a particular stage. This Jira aims to add an repeat 
> identifier to distinguish which cached RDD is being executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42411) Better support for Istio service mesh while running Spark on Kubernetes

2023-03-20 Thread Puneet (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702929#comment-17702929
 ] 

Puneet commented on SPARK-42411:


Should be able to create a PR hopefully by next week. 

> Better support for Istio service mesh while running Spark on Kubernetes
> ---
>
> Key: SPARK-42411
> URL: https://issues.apache.org/jira/browse/SPARK-42411
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.3
>Reporter: Puneet
>Priority: Major
>
> h3. Support for Strict MTLS
> In strict MTLS Peer Authentication Istio requires each pod to be associated 
> with a service identity (as this allows listeners to use the correct cert and 
> chain). Without the service identity communication goes through passthrough 
> cluster which is not permitted in strict mode. Community is still 
> investigating communication through IPs with strict MTLS 
> [https://github.com/istio/istio/issues/37431#issuecomment-1412831780]. Today 
> Spark backend creates a service record for driver however executor pods 
> register with driver using their Pod IPs. In this model therefore, TLS 
> handshake would fail between driver and executor and also between executors. 
> As part of this Jira we want to similarly add service records for the 
> executor pods as well. This can be achieved by adding a 
> ExecutorServiceFeatureStep similar to existing DriverServiceFeatureStep
> h3. Allowing binding to all IPs
> Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to 
> localhost of the pod. Thus if the application container is binding only to 
> Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 
> [https://istio.io/latest/blog/2021/upcoming-networking-changes]. However the 
> old behavior is still accessible through disabling the feature flag 
> PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back 
> [https://github.com/istio/istio/issues/37642]. In current implementation 
> Spark K8s backend does not allow to pass bind address for driver 
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35]
>  however as part of this Jira we want to allow passing of bind address even 
> in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user 
> choose the behavior depending on the state of 
> PILOT_ENABLE_INBOUND_PASSTHROUGH in her Istio cluster.
> h3. Better support for istio-proxy sidecar lifecycle management
> In istio-enabled cluster istio-proxy sidecars would be auto-injected to 
> driver/executor pods. If the application is ephemeral then driver and 
> executor containers would exit, however istio-proxy container would continue 
> to run. This causes driver/executor pods to enter NotReady state. As part of 
> this jira we want ability to run a post stop cleanup after driver/executor 
> container is completed. Similarly we also want to add support for a pre start 
> up script, which can ensure for example that istio-sidecar is up before 
> executor/driver container gets started.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42875) Fix toPandas to handle timezone and map types properly.

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42875:


Assignee: Apache Spark

> Fix toPandas to handle timezone and map types properly.
> ---
>
> Key: SPARK-42875
> URL: https://issues.apache.org/jira/browse/SPARK-42875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42875) Fix toPandas to handle timezone and map types properly.

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702922#comment-17702922
 ] 

Apache Spark commented on SPARK-42875:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40497

> Fix toPandas to handle timezone and map types properly.
> ---
>
> Key: SPARK-42875
> URL: https://issues.apache.org/jira/browse/SPARK-42875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42875) Fix toPandas to handle timezone and map types properly.

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702921#comment-17702921
 ] 

Apache Spark commented on SPARK-42875:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40497

> Fix toPandas to handle timezone and map types properly.
> ---
>
> Key: SPARK-42875
> URL: https://issues.apache.org/jira/browse/SPARK-42875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42875) Fix toPandas to handle timezone and map types properly.

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42875:


Assignee: (was: Apache Spark)

> Fix toPandas to handle timezone and map types properly.
> ---
>
> Key: SPARK-42875
> URL: https://issues.apache.org/jira/browse/SPARK-42875
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36180) Support TimestampNTZ type in Hive

2023-03-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36180.

Resolution: Won't Fix

> Support TimestampNTZ type in Hive
> -
>
> Key: SPARK-36180
> URL: https://issues.apache.org/jira/browse/SPARK-36180
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kent Yao
>Priority: Major
>
>  
> {code:java}
> [info] Caused by: java.lang.IllegalArgumentException: Error: type expected at 
> the position 0 of 'timestamp_ntz:timestamp' but 'timestamp_ntz' is 
> found.[info] Caused by: java.lang.IllegalArgumentException: Error: type 
> expected at the position 0 of 'timestamp_ntz:timestamp' but 'timestamp_ntz' 
> is found.[info]  at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:372)[info]
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)[info]
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:416)[info]
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)[info]
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:814)[info]
>   at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:162)[info]
>   at 
> org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.(LazySerDeParameters.java:91)[info]
>   at 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:116)[info]
>   at 
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:54)[info]
>   at 
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:533)[info]
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:453)[info]
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:440)[info]
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:281)[info]
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:199)[info]  
> at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:842)[info]  
> ... 63 more[info]   at 
> org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)[info]
>    at 
> org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254)[info]   at 
> org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$145(SparkMetadataOperationSuite.scala:666)[info]
>    at 
> org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$145$adapted(SparkMetadataOperationSuite.scala:665)[info]
>    at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$4(HiveThriftServer2Suites.scala:1422)[info]
>    at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$4$adapted(HiveThriftServer2Suites.scala:1422)[info]
>    at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)[info]
>    at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)[info]
>    at 
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)[info]   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.$anonfun$withMultipleConnectionJdbcStatement$1(HiveThriftServer2Suites.scala:1422)[info]
>    at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.tryCaptureSysLog(HiveThriftServer2Suites.scala:1407)[info]
>    at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.withMultipleConnectionJdbcStatement(HiveThriftServer2Suites.scala:1416)[info]
>    at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2TestBase.withJdbcStatement(HiveThriftServer2Suites.scala:1454)[info]
>    at 
> org.apache.spark.sql.hive.thriftserver.SparkMetadataOperationSuite.$anonfun$new$144(SparkMetadataOperationSuite.scala:665)[info]
>    at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)[info]  
>  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)[info]   at 
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)[info]   at 
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)[info]   at 
> org.scalatest.Transformer.apply(Transformer.scala:22)[info]   at 
> org.scalatest.Transformer.apply(Transformer.scala:20)[info]   at 
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)[info]
>    at 
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190[info]   at 
> 

[jira] [Resolved] (SPARK-36045) TO_UTC_TIMESTAMP and FROM_UTC_TIMESTAMP should return TimestampNTZ

2023-03-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36045.

Resolution: Won't Do

> TO_UTC_TIMESTAMP and FROM_UTC_TIMESTAMP should return TimestampNTZ
> --
>
> Key: SPARK-36045
> URL: https://issues.apache.org/jira/browse/SPARK-36045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, the SQL function to_utc_timestamp is confusing: it just takes the 
> timestamp value in the local timezone and then pretends it’s in the provided 
> timezone and then returns the UTC value, but the result is still treated as 
> local timezone!
> The same issue happens in from_utc_timestamp as well.
> We even tried to deprecated in the OSS community: 
> https://github.com/apache/spark/commit/c5e83ab92c0cb514963209dc3e70ba0e24570082
> We should make TO_UTC_TIMESTAMP and FROM_UTC_TIMESTAMP return TimestampNTZ, 
> which makes a lot of sense. converting the current local time to/from UTC 
> local time.
> The functions should accept both Timestamp types: 
> 1. given TimestampLTZ, convert it to TimestampNTZ and continue step #2
> 2. given TimestampNTZ, convert it as to/from UTC local time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35662) Support Timestamp without time zone data type

2023-03-20 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702920#comment-17702920
 ] 

Gengliang Wang commented on SPARK-35662:


[~beliefer] [~ivan.sadikov] [~gurwls223] [~sarutak] [~cloud_fan] Thanks for the 
work!  Marking this one as resolved :)

[~wrschneider99] Yes it will be available in Spark 3.4.0

> Support Timestamp without time zone data type
> -
>
> Key: SPARK-35662
> URL: https://issues.apache.org/jira/browse/SPARK-35662
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark SQL today supports the TIMESTAMP data type. However the semantics 
> provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
> Timestamps embedded in a SQL query or passed through JDBC are presumed to be 
> in session local timezone and cast to UTC before being processed.
>  These are desirable semantics in many cases, such as when dealing with 
> calendars.
>  In many (more) other cases, such as when dealing with log files it is 
> desirable that the provided timestamps not be altered.
>  SQL users expect that they can model either behavior and do so by using 
> TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
> LOCAL TIME ZONE for time zone sensitive data.
>  Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
> be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not 
> exist in the standard.
> In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
> describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
> standard semantic.
>  Using these two types will provide clarity.
>  We will also allow users to set the default behavior for TIMESTAMP to either 
> use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.
> h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
> TimestampWithoutTZ meets or exceeds all function of the existing SQL 
> Timestamp):
>  * Add a new DataType implementation for TimestampWithoutTZ.
>  * Support TimestampWithoutTZ in Dataset/UDF.
>  * TimestampWithoutTZ literals
>  * TimestampWithoutTZ arithmetic(e.g. TimestampWithoutTZ - 
> TimestampWithoutTZ, TimestampWithoutTZ - Date)
>  * Datetime functions/operators: dayofweek, weekofyear, year, etc
>  * Cast to and from TimestampWithoutTZ, cast String/Timestamp to 
> TimestampWithoutTZ, cast TimestampWithoutTZ to string (pretty 
> printing)/Timestamp, with the SQL syntax to specify the types
>  * Support sorting TimestampWithoutTZ.
> h3. Milestone 2 – Persistence:
>  * Ability to create tables of type TimestampWithoutTZ
>  * Ability to write to common file formats such as Parquet and JSON.
>  * INSERT, SELECT, UPDATE, MERGE
>  * Discovery
> h3. Milestone 3 – Client support
>  * JDBC support
>  * Hive Thrift server
> h3. Milestone 4 – PySpark and Spark R integration
>  * Python UDF can take and return TimestampWithoutTZ
>  * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42870) Move `toCatalystValue` to connect-common

2023-03-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42870:


Assignee: Ruifeng Zheng

> Move `toCatalystValue` to connect-common
> 
>
> Key: SPARK-42870
> URL: https://issues.apache.org/jira/browse/SPARK-42870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42870) Move `toCatalystValue` to connect-common

2023-03-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42870.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40485
[https://github.com/apache/spark/pull/40485]

> Move `toCatalystValue` to connect-common
> 
>
> Key: SPARK-42870
> URL: https://issues.apache.org/jira/browse/SPARK-42870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42875) Fix toPandas to handle timezone and map types properly.

2023-03-20 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-42875:
-

 Summary: Fix toPandas to handle timezone and map types properly.
 Key: SPARK-42875
 URL: https://issues.apache.org/jira/browse/SPARK-42875
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42702) Support parameterized CTE

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42702:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Support parameterized CTE
> -
>
> Key: SPARK-42702
> URL: https://issues.apache.org/jira/browse/SPARK-42702
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.1
>
>
> Support named parameters in named common table expressions (CTE). At the 
> moment, such queries failed:
> {code:java}
> CREATE TABLE tbl(namespace STRING) USING parquet
> INSERT INTO tbl SELECT 'abc'
> WITH transitions AS (
>   SELECT * FROM tbl WHERE namespace = :namespace
> ) SELECT * FROM transitions {code}
> w/ the following error:
> {code:java}
> [UNBOUND_SQL_PARAMETER] Found the unbound parameter: `namespace`. Please, fix 
> `args` and provide a mapping of the parameter to a SQL literal.; line 3 pos 
> 38;
> 'WithCTE
> :- 'CTERelationDef 0, false
> :  +- 'SubqueryAlias transitions
> :     +- 'Project [*]
> :        +- 'Filter (namespace#3 = parameter(namespace))
> :           +- SubqueryAlias spark_catalog.default.tbl
> :              +- Relation spark_catalog.default.tbl[namespace#3] parquet
> +- 'Project [*]
>    +- 'SubqueryAlias transitions
>       +- 'CTERelationRef 0, falseorg.apache.spark.sql.AnalysisException: 
> [UNBOUND_SQL_PARAMETER] Found the unbound parameter: `namespace`. Please, fix 
> `args` and provide a mapping of the parameter to a SQL literal.; line 3 pos 
> 38;
> 'WithCTE
> :- 'CTERelationDef 0, false
> :  +- 'SubqueryAlias transitions
> :     +- 'Project [*]
> :        +- 'Filter (namespace#3 = parameter(namespace))
> :           +- SubqueryAlias spark_catalog.default.tbl
> :              +- Relation spark_catalog.default.tbl[namespace#3] parquet
> +- 'Project [*]
>    +- 'SubqueryAlias transitions
>       +- 'CTERelationRef 0, false    at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:339)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:244)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42702) Support parameterized CTE

2023-03-20 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702898#comment-17702898
 ] 

Dongjoon Hyun commented on SPARK-42702:
---

I changed the Fixed Version to 3.4.1 because there is no Apache Spark 3.4.0 RC 
yet with this patch.

> Support parameterized CTE
> -
>
> Key: SPARK-42702
> URL: https://issues.apache.org/jira/browse/SPARK-42702
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.1
>
>
> Support named parameters in named common table expressions (CTE). At the 
> moment, such queries failed:
> {code:java}
> CREATE TABLE tbl(namespace STRING) USING parquet
> INSERT INTO tbl SELECT 'abc'
> WITH transitions AS (
>   SELECT * FROM tbl WHERE namespace = :namespace
> ) SELECT * FROM transitions {code}
> w/ the following error:
> {code:java}
> [UNBOUND_SQL_PARAMETER] Found the unbound parameter: `namespace`. Please, fix 
> `args` and provide a mapping of the parameter to a SQL literal.; line 3 pos 
> 38;
> 'WithCTE
> :- 'CTERelationDef 0, false
> :  +- 'SubqueryAlias transitions
> :     +- 'Project [*]
> :        +- 'Filter (namespace#3 = parameter(namespace))
> :           +- SubqueryAlias spark_catalog.default.tbl
> :              +- Relation spark_catalog.default.tbl[namespace#3] parquet
> +- 'Project [*]
>    +- 'SubqueryAlias transitions
>       +- 'CTERelationRef 0, falseorg.apache.spark.sql.AnalysisException: 
> [UNBOUND_SQL_PARAMETER] Found the unbound parameter: `namespace`. Please, fix 
> `args` and provide a mapping of the parameter to a SQL literal.; line 3 pos 
> 38;
> 'WithCTE
> :- 'CTERelationDef 0, false
> :  +- 'SubqueryAlias transitions
> :     +- 'Project [*]
> :        +- 'Filter (namespace#3 = parameter(namespace))
> :           +- SubqueryAlias spark_catalog.default.tbl
> :              +- Relation spark_catalog.default.tbl[namespace#3] parquet
> +- 'Project [*]
>    +- 'SubqueryAlias transitions
>       +- 'CTERelationRef 0, false    at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:52)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5(CheckAnalysis.scala:339)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$5$adapted(CheckAnalysis.scala:244)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42818) Implement DataFrameReader/Writer.jdbc

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42818:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Implement DataFrameReader/Writer.jdbc
> -
>
> Key: SPARK-42818
> URL: https://issues.apache.org/jira/browse/SPARK-42818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42767) Add check condition to start connect server fallback with `in-memory` and auto ignored some tests strongly depend on hive

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42767:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Add check condition to start connect server fallback with `in-memory` and 
> auto ignored some tests strongly depend on hive
> -
>
> Key: SPARK-42767
> URL: https://issues.apache.org/jira/browse/SPARK-42767
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, Tests
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42812) client_type is missing from AddArtifactsRequest proto message

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42812.
---
Fix Version/s: 3.4.1
 Assignee: Venkata Sai Akhil Gudesa
   Resolution: Fixed

> client_type is missing from AddArtifactsRequest proto message
> -
>
> Key: SPARK-42812
> URL: https://issues.apache.org/jira/browse/SPARK-42812
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Venkata Sai Akhil Gudesa
>Priority: Major
> Fix For: 3.4.1
>
>
> The client_type is missing from AddArtifactsRequest proto message



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42817) Spark driver logs are filled with Initializing service data for shuffle service using name

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42817:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Spark driver logs are filled with Initializing service data for shuffle 
> service using name
> --
>
> Key: SPARK-42817
> URL: https://issues.apache.org/jira/browse/SPARK-42817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.4.1
>
>
> With SPARK-34828, we added the ability to make the shuffle service name 
> configurable and we added a log 
> [here|https://github.com/apache/spark/blob/8860f69455e5a722626194c4797b4b42cccd4510/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnable.scala#L118]
>  that will log the shuffle service name. However, this log is printed in the 
> driver logs whenever there is new executor launched and pollutes the log. 
> {code}
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> 22/08/03 20:42:07 INFO ExecutorRunnable: Initializing service data for 
> shuffle service using name 'spark_shuffle_311'
> {code}
> We can just log this once in the driver.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42826) Add migration notes for update to supported pandas version.

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42826:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Add migration notes for update to supported pandas version.
> ---
>
> Key: SPARK-42826
> URL: https://issues.apache.org/jira/browse/SPARK-42826
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.1
>
>
> We deprecate & remove some APIs from 
> https://issues.apache.org/jira/browse/SPARK-42593. to follow the pandas.
> We should mention this in migration guide.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42020) createDataFrame with UDT

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42020:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> createDataFrame with UDT
> 
>
> Key: SPARK-42020
> URL: https://issues.apache.org/jira/browse/SPARK-42020
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.1
>
>
> {code}
> pyspark/sql/tests/test_types.py:596 
> (TypesParityTests.test_apply_schema_with_udt)
> self =  testMethod=test_apply_schema_with_udt>
> def test_apply_schema_with_udt(self):
> row = (1.0, ExamplePoint(1.0, 2.0))
> schema = StructType(
> [
> StructField("label", DoubleType(), False),
> StructField("point", ExamplePointUDT(), False),
> ]
> )
> >   df = self.spark.createDataFrame([row], schema)
> ../test_types.py:605: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> ../../connect/session.py:282: in createDataFrame
> _table = pa.Table.from_pylist([dict(zip(_cols, list(item))) for item in 
> _data])
> pyarrow/table.pxi:3700: in pyarrow.lib.Table.from_pylist
> ???
> pyarrow/table.pxi:5221: in pyarrow.lib._from_pylist
> ???
> pyarrow/table.pxi:3575: in pyarrow.lib.Table.from_arrays
> ???
> pyarrow/table.pxi:1383: in pyarrow.lib._sanitize_arrays
> ???
> pyarrow/table.pxi:1364: in pyarrow.lib._schema_from_arrays
> ???
> pyarrow/array.pxi:320: in pyarrow.lib.array
> ???
> pyarrow/array.pxi:39: in pyarrow.lib._sequence_to_array
> ???
> pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
> ???
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> >   ???
> E   pyarrow.lib.ArrowInvalid: Could not convert ExamplePoint(1.0,2.0) with 
> type ExamplePoint: did not recognize Python value type when inferring an 
> Arrow data type
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41843) Implement SparkSession.udf

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41843:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Implement SparkSession.udf
> --
>
> Key: SPARK-41843
> URL: https://issues.apache.org/jira/browse/SPARK-41843
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Priority: Major
> Fix For: 3.4.1
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/functions.py", 
> line 2331, in pyspark.sql.connect.functions.call_udf
> Failed example:
>     _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 1, in 
> 
>         _ = spark.udf.register("intX2", lambda i: i * 2, IntegerType())
>     AttributeError: 'SparkSession' object has no attribute 'udf'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42848) Implement DataFrame.registerTempTable

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42848:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Implement DataFrame.registerTempTable
> -
>
> Key: SPARK-42848
> URL: https://issues.apache.org/jira/browse/SPARK-42848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41818) Support DataFrameWriter.saveAsTable

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41818:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Support DataFrameWriter.saveAsTable
> ---
>
> Key: SPARK-41818
> URL: https://issues.apache.org/jira/browse/SPARK-41818
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.1
>
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 369, in pyspark.sql.connect.readwriter.DataFrameWriter.insertInto
> Failed example:
>     df.write.saveAsTable("tblA")
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File " pyspark.sql.connect.readwriter.DataFrameWriter.insertInto[2]>", line 1, in 
> 
>         df.write.saveAsTable("tblA")
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/readwriter.py", 
> line 350, in saveAsTable
>         
> self._spark.client.execute_command(self._write.command(self._spark.client))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 459, in execute_command
>         self._execute(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 547, in _execute
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 623, in _handle_error
>         raise SparkConnectException(status.message, info.reason) from None
>     pyspark.sql.connect.client.SparkConnectException: 
> (java.lang.ClassNotFoundException) .DefaultSource{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42824) Provide a clear error message for unsupported JVM attributes.

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42824:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Provide a clear error message for unsupported JVM attributes.
> -
>
> Key: SPARK-42824
> URL: https://issues.apache.org/jira/browse/SPARK-42824
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.1
>
>
> There are attributes, such as "_jvm", that were accessible in PySpark but 
> cannot be accessed in Spark Connect. We need to display appropriate error 
> messages for these cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42778) QueryStageExec should respect supportsRowBased

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42778:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> QueryStageExec should respect supportsRowBased
> --
>
> Key: SPARK-42778
> URL: https://issues.apache.org/jira/browse/SPARK-42778
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42247) Standardize `returnType` property of UserDefinedFunction

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42247:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Standardize `returnType` property of UserDefinedFunction
> 
>
> Key: SPARK-42247
> URL: https://issues.apache.org/jira/browse/SPARK-42247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.1
>
>
> There are checks 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions

2023-03-20 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42852:
--
Fix Version/s: 3.4.1
   (was: 3.4.0)

> Revert NamedLambdaVariable related changes from EquivalentExpressions
> -
>
> Key: SPARK-42852
> URL: https://issues.apache.org/jira/browse/SPARK-42852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.1
>
>
> See discussion 
> https://github.com/apache/spark/pull/40473#issuecomment-1474848224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41585) The Spark exclude node functionality for YARN should work independently of dynamic allocation

2023-03-20 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-41585.
---
   Fix Version/s: 3.5.0
Target Version/s: 3.5.0
Assignee: Luca Canali
  Resolution: Fixed

> The Spark exclude node functionality for YARN should work independently of 
> dynamic allocation
> -
>
> Key: SPARK-41585
> URL: https://issues.apache.org/jira/browse/SPARK-41585
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.5.0
>
>
> The Spark exclude node functionality for Spark on YARN, introduced in 
> SPARK-26688, allows users to specify a list of node names that are excluded 
> from resource allocation. This is done using the configuration parameter: 
> {{spark.yarn.exclude.nodes}}
> The feature currently works only for executors allocated via dynamic 
> allocation. To use the feature on Spark 3.3.1, for example, one may set the 
> configurations {{{}spark.dynamicAllocation.enabled{}}}=true, 
> spark.dynamicAllocation.minExecutors=0 and spark.executor.instances=0, thus 
> making Spark spawning executors only via dynamic allocation.
> This proposes to document this behavior for the current Spark release and 
> also proposes an improvement of this feature by extending the scope of Spark 
> exclude node functionality for YARN beyond dynamic allocation, which I 
> believe makes it more generally useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41971) `toPandas` should support duplicate filed names when arrow-optimization is on

2023-03-20 Thread Niket Jain (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702892#comment-17702892
 ] 

Niket Jain commented on SPARK-41971:


Can I work on this issue?

> `toPandas` should support duplicate filed names when arrow-optimization is on
> -
>
> Key: SPARK-41971
> URL: https://issues.apache.org/jira/browse/SPARK-41971
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>
> toPandas support duplicate columns name, but for a struct column, it doesnot 
> support duplicate field names.
> {code:java}
> In [27]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", False)
> In [28]: spark.sql("select 1 v, 1 v").toPandas()
> Out[28]: 
>v  v
> 0  1  1
> In [29]: spark.sql("select struct(1 v, 1 v)").toPandas()
> Out[29]: 
>   struct(1 AS v, 1 AS v)
> 0 (1, 1)
> In [30]: spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", True)
> In [31]: spark.sql("select 1 v, 1 v").toPandas()
> Out[31]: 
>v  v
> 0  1  1
> In [32]: spark.sql("select struct(1 v, 1 v)").toPandas()
> /Users/ruifeng.zheng/Dev/spark/python/pyspark/sql/pandas/conversion.py:204: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true, but has reached 
> the error below and can not continue. Note that 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' does not have an effect 
> on failures in the middle of computation.
>   Ran out of field metadata, likely malformed
>   warn(msg)
> ---
> ArrowInvalid  Traceback (most recent call last)
> Cell In[32], line 1
> > 1 spark.sql("select struct(1 v, 1 v)").toPandas()
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:143, in 
> PandasConversionMixin.toPandas(self)
> 141 tmp_column_names = ["col_{}".format(i) for i in 
> range(len(self.columns))]
> 142 self_destruct = jconf.arrowPySparkSelfDestructEnabled()
> --> 143 batches = self.toDF(*tmp_column_names)._collect_as_arrow(
> 144 split_batches=self_destruct
> 145 )
> 146 if len(batches) > 0:
> 147 table = pyarrow.Table.from_batches(batches)
> File ~/Dev/spark/python/pyspark/sql/pandas/conversion.py:358, in 
> PandasConversionMixin._collect_as_arrow(self, split_batches)
> 356 results.append(batch_or_indices)
> 357 else:
> --> 358 results = list(batch_stream)
> 359 finally:
> 360 # Join serving thread and raise any exceptions from 
> collectAsArrowToPython
> 361 jsocket_auth_server.getResult()
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:55, in 
> ArrowCollectSerializer.load_stream(self, stream)
>  50 """
>  51 Load a stream of un-ordered Arrow RecordBatches, where the last 
> iteration yields
>  52 a list of indices that can be used to put the RecordBatches in the 
> correct order.
>  53 """
>  54 # load the batches
> ---> 55 for batch in self.serializer.load_stream(stream):
>  56 yield batch
>  58 # load the batch order indices or propagate any error that occurred 
> in the JVM
> File ~/Dev/spark/python/pyspark/sql/pandas/serializers.py:98, in 
> ArrowStreamSerializer.load_stream(self, stream)
>  95 import pyarrow as pa
>  97 reader = pa.ipc.open_stream(stream)
> ---> 98 for batch in reader:
>  99 yield batch
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:638,
>  in __iter__()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/ipc.pxi:674,
>  in pyarrow.lib.RecordBatchReader.read_next_batch()
> File 
> ~/.dev/miniconda3/envs/spark_dev/lib/python3.9/site-packages/pyarrow/error.pxi:100,
>  in pyarrow.lib.check_status()
> ArrowInvalid: Ran out of field metadata, likely malformed
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42874) Enable new golden file test framework for analysis for all input files

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702888#comment-17702888
 ] 

Apache Spark commented on SPARK-42874:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/40496

> Enable new golden file test framework for analysis for all input files
> --
>
> Key: SPARK-42874
> URL: https://issues.apache.org/jira/browse/SPARK-42874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42874) Enable new golden file test framework for analysis for all input files

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42874:


Assignee: Apache Spark

> Enable new golden file test framework for analysis for all input files
> --
>
> Key: SPARK-42874
> URL: https://issues.apache.org/jira/browse/SPARK-42874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42874) Enable new golden file test framework for analysis for all input files

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42874:


Assignee: (was: Apache Spark)

> Enable new golden file test framework for analysis for all input files
> --
>
> Key: SPARK-42874
> URL: https://issues.apache.org/jira/browse/SPARK-42874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42874) Enable new golden file test framework for analysis for all input files

2023-03-20 Thread Daniel (Jira)
Daniel created SPARK-42874:
--

 Summary: Enable new golden file test framework for analysis for 
all input files
 Key: SPARK-42874
 URL: https://issues.apache.org/jira/browse/SPARK-42874
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35662) Support Timestamp without time zone data type

2023-03-20 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35662.

Fix Version/s: 3.4.0
   Resolution: Fixed

> Support Timestamp without time zone data type
> -
>
> Key: SPARK-35662
> URL: https://issues.apache.org/jira/browse/SPARK-35662
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>
> Spark SQL today supports the TIMESTAMP data type. However the semantics 
> provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. 
> Timestamps embedded in a SQL query or passed through JDBC are presumed to be 
> in session local timezone and cast to UTC before being processed.
>  These are desirable semantics in many cases, such as when dealing with 
> calendars.
>  In many (more) other cases, such as when dealing with log files it is 
> desirable that the provided timestamps not be altered.
>  SQL users expect that they can model either behavior and do so by using 
> TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH 
> LOCAL TIME ZONE for time zone sensitive data.
>  Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will 
> be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not 
> exist in the standard.
> In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to 
> describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for 
> standard semantic.
>  Using these two types will provide clarity.
>  We will also allow users to set the default behavior for TIMESTAMP to either 
> use TIMESTAMP WITH LOCAL TIME ZONE or TIMESTAMP WITHOUT TIME ZONE.
> h3. Milestone 1 – Spark Timestamp equivalency ( The new Timestamp type 
> TimestampWithoutTZ meets or exceeds all function of the existing SQL 
> Timestamp):
>  * Add a new DataType implementation for TimestampWithoutTZ.
>  * Support TimestampWithoutTZ in Dataset/UDF.
>  * TimestampWithoutTZ literals
>  * TimestampWithoutTZ arithmetic(e.g. TimestampWithoutTZ - 
> TimestampWithoutTZ, TimestampWithoutTZ - Date)
>  * Datetime functions/operators: dayofweek, weekofyear, year, etc
>  * Cast to and from TimestampWithoutTZ, cast String/Timestamp to 
> TimestampWithoutTZ, cast TimestampWithoutTZ to string (pretty 
> printing)/Timestamp, with the SQL syntax to specify the types
>  * Support sorting TimestampWithoutTZ.
> h3. Milestone 2 – Persistence:
>  * Ability to create tables of type TimestampWithoutTZ
>  * Ability to write to common file formats such as Parquet and JSON.
>  * INSERT, SELECT, UPDATE, MERGE
>  * Discovery
> h3. Milestone 3 – Client support
>  * JDBC support
>  * Hive Thrift server
> h3. Milestone 4 – PySpark and Spark R integration
>  * Python UDF can take and return TimestampWithoutTZ
>  * DataFrame support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42839) Assign a name to the error class _LEGACY_ERROR_TEMP_2003

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42839:


Assignee: (was: Apache Spark)

> Assign a name to the error class _LEGACY_ERROR_TEMP_2003
> 
>
> Key: SPARK-42839
> URL: https://issues.apache.org/jira/browse/SPARK-42839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
> Attachments: Screenshot from 2023-03-21 00-20-11.png
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2003* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42839) Assign a name to the error class _LEGACY_ERROR_TEMP_2003

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42839:


Assignee: Apache Spark

> Assign a name to the error class _LEGACY_ERROR_TEMP_2003
> 
>
> Key: SPARK-42839
> URL: https://issues.apache.org/jira/browse/SPARK-42839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
> Attachments: Screenshot from 2023-03-21 00-20-11.png
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2003* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42839) Assign a name to the error class _LEGACY_ERROR_TEMP_2003

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702841#comment-17702841
 ] 

Apache Spark commented on SPARK-42839:
--

User 'ruilibuaa' has created a pull request for this issue:
https://github.com/apache/spark/pull/40493

> Assign a name to the error class _LEGACY_ERROR_TEMP_2003
> 
>
> Key: SPARK-42839
> URL: https://issues.apache.org/jira/browse/SPARK-42839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
> Attachments: Screenshot from 2023-03-21 00-20-11.png
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2003* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42773) Minor grammatical change to "Supports Spark Connect" message

2023-03-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42773:
-
Priority: Trivial  (was: Major)

> Minor grammatical change to "Supports Spark Connect" message
> 
>
> Key: SPARK-42773
> URL: https://issues.apache.org/jira/browse/SPARK-42773
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Assignee: Allan Folting
>Priority: Trivial
> Fix For: 3.4.1
>
>
> Changing "Support Spark Connect" to "Supports Spark Connect" in the 3.4.0 
> version change message which is also used in the documentation:
>  
> .. versionchanged:: 3.4.0
>      Supports Spark Connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42839) Assign a name to the error class _LEGACY_ERROR_TEMP_2003

2023-03-20 Thread LI RUI (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LI RUI updated SPARK-42839:
---
Attachment: Screenshot from 2023-03-21 00-20-11.png

> Assign a name to the error class _LEGACY_ERROR_TEMP_2003
> 
>
> Key: SPARK-42839
> URL: https://issues.apache.org/jira/browse/SPARK-42839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
> Attachments: Screenshot from 2023-03-21 00-20-11.png
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2003* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42839) Assign a name to the error class _LEGACY_ERROR_TEMP_2003

2023-03-20 Thread LI RUI (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702838#comment-17702838
 ] 

LI RUI commented on SPARK-42839:


Hey, Max~ I am trying to complete this task. I have submitted a commit on 
GitHub where I made the following changes: 1) I replaced 
"_LEGACY_ERROR_TEMP_2003" with "CANNOT_ZIP_MAPS". 2) I created a new test case 
where I attempted to use checkError() and added a new exception definition in 
AlreadyExistException.scala. However, I found that instead of throwing an 
AnalysisException, it was throwing a SparkException. So, I switched to using 
assert instead. I'm not sure if this is the correct approach, could you please 
provide some guidance? The results of the execution are attached.

> Assign a name to the error class _LEGACY_ERROR_TEMP_2003
> 
>
> Key: SPARK-42839
> URL: https://issues.apache.org/jira/browse/SPARK-42839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
> Attachments: Screenshot from 2023-03-21 00-20-11.png
>
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_2003* defined in 
> {*}core/src/main/resources/error/error-classes.json{*}. The name should be 
> short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42873) Define Spark SQL types as keywords

2023-03-20 Thread Max Gekk (Jira)
Max Gekk created SPARK-42873:


 Summary: Define Spark SQL types as keywords
 Key: SPARK-42873
 URL: https://issues.apache.org/jira/browse/SPARK-42873
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Max Gekk
Assignee: Max Gekk


Currently, Spark SQL defines primitive types as:

 
{code}
| identifier (LEFT_PAREN INTEGER_VALUE
  (COMMA INTEGER_VALUE)* RIGHT_PAREN)?  #primitiveDataType
{code}
where identifier is parsed later by visitPrimitiveDataType():

{code:scala}
  override def visitPrimitiveDataType(ctx: PrimitiveDataTypeContext): DataType 
= withOrigin(ctx) {
val dataType = ctx.identifier.getText.toLowerCase(Locale.ROOT)
(dataType, ctx.INTEGER_VALUE().asScala.toList) match {
  case ("boolean", Nil) => BooleanType
  case ("tinyint" | "byte", Nil) => ByteType
  case ("smallint" | "short", Nil) => ShortType
  case ("int" | "integer", Nil) => IntegerType
  case ("bigint" | "long", Nil) => LongType
  case ("float" | "real", Nil) => FloatType
...
{code}

So, the types are not Spark SQL keywords, and this causes some inconveniences 
while analysing/transforming the lexer tree. For example, while forming the 
stable column aliases.

Need to define Spark SQL types in SqlBaseLexer.g4.

Also, typed literals have the same issue. The types "DATE", "TIMESTAMP_NTZ", 
"TIMESTAMP", "TIMESTAMP_LTZ", "INTERVAL", and "X" should be defined as base 
lexer tokens. 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42791) Create golden file test framework for analysis

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702811#comment-17702811
 ] 

Apache Spark commented on SPARK-42791:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40492

> Create golden file test framework for analysis
> --
>
> Key: SPARK-42791
> URL: https://issues.apache.org/jira/browse/SPARK-42791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>
> Here we track the work to add new golden file test support for the Spark 
> analyzer. Each golden file can contain a list of SQL queries followed by the 
> string representations of their analyzed logical plans.
>  
> This can be similar to Spark's existing `SQLQueryTestSuite` [1], but stopping 
> after analysis and listing analyzed plans as the results instead of fully 
> executing queries end-to-end. As another example, ZetaSQL has analyzer-based 
> golden file testing like this as well [2].
>  
> This way, any changes to analysis will show up as test diffs, which are easy 
> to spot in review and also easy to update automatically. This could help the 
> community together maintain the qualify of Apache Spark's query analysis.
>  
> [1] 
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala]
>  
> [2] 
> [https://github.com/google/zetasql/blob/master/zetasql/analyzer/testdata/limit.test].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42872) Spark SQL reads unnecessary nested fields

2023-03-20 Thread Jiri Humpolicek (Jira)
Jiri Humpolicek created SPARK-42872:
---

 Summary: Spark SQL reads unnecessary nested fields
 Key: SPARK-42872
 URL: https://issues.apache.org/jira/browse/SPARK-42872
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.2
Reporter: Jiri Humpolicek


When we use high order functions in spark sql query, it would be great if it 
will be somehow possible to write following example in way that spark will read 
only necessary nested fields.

Example:
1) Loading data
{code:scala}
val jsonStr = """{
 "items": [
   {"itemId": 1, "itemData": "a"},
   {"itemId": 2, "itemData": "b"}
 ]
}"""
val df = spark.read.json(Seq(jsonStr).toDS)
df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
{code}
2) read query with explain
{code:scala}
val read = spark.table("persisted")
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)

read.select(transform($"items", 
i=>i.getItem("itemId")).as('itemIds)).explain(true)
// ReadSchema: struct>>
{code}
We use only *itemId* field from structure in array, but read schema contains 
all fields of structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42790) Abstract the excluded method for better test for JDBC docker tests.

2023-03-20 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42790.
--
Fix Version/s: 3.5.0
 Assignee: jiaan.geng
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/40418

> Abstract the excluded method for better test for JDBC docker tests.
> ---
>
> Key: SPARK-42790
> URL: https://issues.apache.org/jira/browse/SPARK-42790
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41006:


Assignee: Apache Spark

> ConfigMap has the same name when launching two pods on the same namespace
> -
>
> Key: SPARK-41006
> URL: https://issues.apache.org/jira/browse/SPARK-41006
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Eric
>Assignee: Apache Spark
>Priority: Minor
>
> If we use the Spark Launcher to launch our spark apps in k8s:
> {code:java}
> val sparkLauncher = new InProcessLauncher()
>  .setMaster(k8sMaster)
>  .setDeployMode(deployMode)
>  .setAppName(appName)
>  .setVerbose(true)
> sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
> We have an issue when we launch another spark driver in the same namespace 
> where other spark app was running:
> {code:java}
> kp -n audit-exporter-eee5073aac -w
> NAME                                     READY   STATUS        RESTARTS   AGE
> audit-exporter-71489e843d8085c0-driver   1/1     Running       0          
> 9m54s
> audit-exporter-7e6b8b843d80b9e6-exec-1   1/1     Running       0          
> 9m40s
> data-io-120204843d899567-driver          0/1     Terminating   0          1s
> data-io-120204843d899567-driver          0/1     Terminating   0          2s
> data-io-120204843d899567-driver          0/1     Terminating   0          3s
> data-io-120204843d899567-driver          0/1     Terminating   0          
> 3s{code}
> The error is:
> {code:java}
> {"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38:
>  'data-io'","msg":"Application failed with 
> exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: PUT at: 
> https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map.
>  Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: 
> Forbidden: field is immutable when `immutable` is set. Received status: 
> Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: 
> field is immutable when `immutable` is set, reason=FieldValueForbidden, 
> additionalProperties={})], group=null, kind=ConfigMap, 
> name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=ConfigMap 
> \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is 
> immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).\n\tat 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown
>  Source)\n\tat 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown
>  Source)\n\tat 
> io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
>  
> io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
>  
> 

[jira] [Commented] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702710#comment-17702710
 ] 

Apache Spark commented on SPARK-41006:
--

User 'DHKold' has created a pull request for this issue:
https://github.com/apache/spark/pull/40491

> ConfigMap has the same name when launching two pods on the same namespace
> -
>
> Key: SPARK-41006
> URL: https://issues.apache.org/jira/browse/SPARK-41006
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Eric
>Priority: Minor
>
> If we use the Spark Launcher to launch our spark apps in k8s:
> {code:java}
> val sparkLauncher = new InProcessLauncher()
>  .setMaster(k8sMaster)
>  .setDeployMode(deployMode)
>  .setAppName(appName)
>  .setVerbose(true)
> sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
> We have an issue when we launch another spark driver in the same namespace 
> where other spark app was running:
> {code:java}
> kp -n audit-exporter-eee5073aac -w
> NAME                                     READY   STATUS        RESTARTS   AGE
> audit-exporter-71489e843d8085c0-driver   1/1     Running       0          
> 9m54s
> audit-exporter-7e6b8b843d80b9e6-exec-1   1/1     Running       0          
> 9m40s
> data-io-120204843d899567-driver          0/1     Terminating   0          1s
> data-io-120204843d899567-driver          0/1     Terminating   0          2s
> data-io-120204843d899567-driver          0/1     Terminating   0          3s
> data-io-120204843d899567-driver          0/1     Terminating   0          
> 3s{code}
> The error is:
> {code:java}
> {"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38:
>  'data-io'","msg":"Application failed with 
> exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: PUT at: 
> https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map.
>  Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: 
> Forbidden: field is immutable when `immutable` is set. Received status: 
> Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: 
> field is immutable when `immutable` is set, reason=FieldValueForbidden, 
> additionalProperties={})], group=null, kind=ConfigMap, 
> name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=ConfigMap 
> \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is 
> immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).\n\tat 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown
>  Source)\n\tat 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown
>  Source)\n\tat 
> io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
>  
> io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
>  
> 

[jira] [Assigned] (SPARK-41006) ConfigMap has the same name when launching two pods on the same namespace

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41006:


Assignee: (was: Apache Spark)

> ConfigMap has the same name when launching two pods on the same namespace
> -
>
> Key: SPARK-41006
> URL: https://issues.apache.org/jira/browse/SPARK-41006
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: Eric
>Priority: Minor
>
> If we use the Spark Launcher to launch our spark apps in k8s:
> {code:java}
> val sparkLauncher = new InProcessLauncher()
>  .setMaster(k8sMaster)
>  .setDeployMode(deployMode)
>  .setAppName(appName)
>  .setVerbose(true)
> sparkLauncher.startApplication(new SparkAppHandle.Listener { ...{code}
> We have an issue when we launch another spark driver in the same namespace 
> where other spark app was running:
> {code:java}
> kp -n audit-exporter-eee5073aac -w
> NAME                                     READY   STATUS        RESTARTS   AGE
> audit-exporter-71489e843d8085c0-driver   1/1     Running       0          
> 9m54s
> audit-exporter-7e6b8b843d80b9e6-exec-1   1/1     Running       0          
> 9m40s
> data-io-120204843d899567-driver          0/1     Terminating   0          1s
> data-io-120204843d899567-driver          0/1     Terminating   0          2s
> data-io-120204843d899567-driver          0/1     Terminating   0          3s
> data-io-120204843d899567-driver          0/1     Terminating   0          
> 3s{code}
> The error is:
> {code:java}
> {"time":"2022-11-03T12:49:45.626Z","lvl":"WARN","logger":"o.a.s.l.InProcessAppHandle","thread":"spark-app-38:
>  'data-io'","msg":"Application failed with 
> exception.","stack_trace":"io.fabric8.kubernetes.client.KubernetesClientException:
>  Failure executing: PUT at: 
> https://kubernetes.default/api/v1/namespaces/audit-exporter-eee5073aac/configmaps/spark-drv-d19c37843d80350c-conf-map.
>  Message: ConfigMap \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: 
> Forbidden: field is immutable when `immutable` is set. Received status: 
> Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=data, message=Forbidden: 
> field is immutable when `immutable` is set, reason=FieldValueForbidden, 
> additionalProperties={})], group=null, kind=ConfigMap, 
> name=spark-drv-d19c37843d80350c-conf-map, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=ConfigMap 
> \"spark-drv-d19c37843d80350c-conf-map\" is invalid: data: Forbidden: field is 
> immutable when `immutable` is set, metadata=ListMeta(_continue=null, 
> remainingItemCount=null, resourceVersion=null, selfLink=null, 
> additionalProperties={}), reason=Invalid, status=Failure, 
> additionalProperties={}).\n\tat 
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:682)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:661)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:612)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:555)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:518)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:342)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleUpdate(OperationSupport.java:322)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleUpdate(BaseOperation.java:649)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:195)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation$$Lambda$5360/00.apply(Unknown
>  Source)\n\tat 
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:200)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:141)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation$$Lambda$4618/00.apply(Unknown
>  Source)\n\tat 
> io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.replace(CreateOrReplaceHelper.java:69)\n\tat
>  
> io.fabric8.kubernetes.client.utils.CreateOrReplaceHelper.createOrReplace(CreateOrReplaceHelper.java:61)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:318)\n\tat
>  
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.createOrReplace(BaseOperation.java:83)\n\tat
>  
> 

[jira] [Assigned] (SPARK-42536) Upgrade log4j2 to 2.20.0

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42536:


Assignee: (was: Apache Spark)

> Upgrade log4j2 to 2.20.0
> 
>
> Key: SPARK-42536
> URL: https://issues.apache.org/jira/browse/SPARK-42536
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://logging.apache.org/log4j/2.x/release-notes/2.20.0.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42536) Upgrade log4j2 to 2.20.0

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42536:


Assignee: Apache Spark

> Upgrade log4j2 to 2.20.0
> 
>
> Key: SPARK-42536
> URL: https://issues.apache.org/jira/browse/SPARK-42536
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> [https://logging.apache.org/log4j/2.x/release-notes/2.20.0.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42536) Upgrade log4j2 to 2.20.0

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702667#comment-17702667
 ] 

Apache Spark commented on SPARK-42536:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40490

> Upgrade log4j2 to 2.20.0
> 
>
> Key: SPARK-42536
> URL: https://issues.apache.org/jira/browse/SPARK-42536
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://logging.apache.org/log4j/2.x/release-notes/2.20.0.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42871) Upgrade slf4j to 2.0.7

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702662#comment-17702662
 ] 

Apache Spark commented on SPARK-42871:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40489

> Upgrade slf4j to 2.0.7
> --
>
> Key: SPARK-42871
> URL: https://issues.apache.org/jira/browse/SPARK-42871
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> https://www.slf4j.org/news.html#2.0.7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42536) Upgrade log4j2 to 2.20.0

2023-03-20 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-42536:
-
Description: 
[https://logging.apache.org/log4j/2.x/release-notes/2.20.0.html]  (was: Need 
wait upgrade slf4j 2.0.7 first
 * [https://logging.apache.org/log4j/2.x/release-notes/2.20.0.html]
 * https://jira.qos.ch/browse/SLF4J-511)

> Upgrade log4j2 to 2.20.0
> 
>
> Key: SPARK-42536
> URL: https://issues.apache.org/jira/browse/SPARK-42536
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Minor
>
> [https://logging.apache.org/log4j/2.x/release-notes/2.20.0.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42871) Upgrade slf4j to 2.0.7

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42871:


Assignee: (was: Apache Spark)

> Upgrade slf4j to 2.0.7
> --
>
> Key: SPARK-42871
> URL: https://issues.apache.org/jira/browse/SPARK-42871
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> https://www.slf4j.org/news.html#2.0.7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42871) Upgrade slf4j to 2.0.7

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42871:


Assignee: Apache Spark

> Upgrade slf4j to 2.0.7
> --
>
> Key: SPARK-42871
> URL: https://issues.apache.org/jira/browse/SPARK-42871
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> https://www.slf4j.org/news.html#2.0.7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42871) Upgrade slf4j to 2.0.7

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702661#comment-17702661
 ] 

Apache Spark commented on SPARK-42871:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40489

> Upgrade slf4j to 2.0.7
> --
>
> Key: SPARK-42871
> URL: https://issues.apache.org/jira/browse/SPARK-42871
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> https://www.slf4j.org/news.html#2.0.7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42871) Upgrade slf4j to 2.0.7

2023-03-20 Thread Yang Jie (Jira)
Yang Jie created SPARK-42871:


 Summary: Upgrade slf4j to 2.0.7
 Key: SPARK-42871
 URL: https://issues.apache.org/jira/browse/SPARK-42871
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


https://www.slf4j.org/news.html#2.0.7



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42851) EquivalentExpressions methods need to be consistently guarded by supportedExpression

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702658#comment-17702658
 ] 

Apache Spark commented on SPARK-42851:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40488

> EquivalentExpressions methods need to be consistently guarded by 
> supportedExpression
> 
>
> Key: SPARK-42851
> URL: https://issues.apache.org/jira/browse/SPARK-42851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Kris Mok
>Priority: Major
>
> SPARK-41468 tried to fix a bug but introduced a new regression. Its change to 
> {{EquivalentExpressions}} added a {{supportedExpression()}} guard to the 
> {{addExprTree()}} and {{getExprState()}} methods, but didn't add the same 
> guard to the other "add" entry point -- {{addExpr()}}.
> As such, uses that add single expressions to CSE via {{addExpr()}} may 
> succeed, but upon retrieval via {{getExprState()}} it'd inconsistently get a 
> {{None}} due to failing the guard.
> We need to make sure the "add" and "get" methods are consistent. It could be 
> done by one of:
> 1. Adding the same {{supportedExpression()}} guard to {{addExpr()}}, or
> 2. Removing the guard from {{getExprState()}}, relying solely on the guard on 
> the "add" path to make sure only intended state is added.
> (or other alternative refactorings to fuse the guard into various methods to 
> make it more efficient)
> There are pros and cons to the two directions above, because {{addExpr()}} 
> used to allow (potentially incorrect) more expressions to get CSE'd, making 
> it more restrictive may cause performance regressions (for the cases that 
> happened to work).
> Example:
> {code:sql}
> select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) 
> from range(2)
> {code}
> Running this query on Spark 3.2 branch returns the correct value:
> {code}
> scala> spark.sql("select max(transform(array(id), x -> x)), 
> max(transform(array(id), x -> x)) from range(2)").collect
> res0: Array[org.apache.spark.sql.Row] = 
> Array([WrappedArray(1),WrappedArray(1)])
> {code}
> Here, {{transform(array(id), x -> x)}} is an {{AggregateExpression}} that was 
> (potentially unsafely) recognized by {{addExpr()}} as a common subexpression, 
> and {{getExprState()}} doesn't do extra guarding, so during physical 
> planning, in {{PhysicalAggregation}} this expression gets CSE'd in both the 
> aggregation expression list and the result expressions list.
> {code}
> AdaptiveSparkPlan isFinalPlan=false
> +- SortAggregate(key=[], functions=[max(transform(array(id#0L), 
> lambdafunction(lambda x#1L, lambda x#1L, false)))])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- SortAggregate(key=[], functions=[partial_max(transform(array(id#0L), 
> lambdafunction(lambda x#1L, lambda x#1L, false)))])
>  +- Range (0, 2, step=1, splits=16)
> {code}
> Running the same query on current master triggers an error when binding the 
> result expression to the aggregate expression in the Aggregate operators (for 
> a WSCG-enabled operator like {{HashAggregateExec}}, the same error would show 
> up during codegen):
> {code}
> ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 16) (ip-10-110-16-93.us-west-2.compute.internal executor driver): 
> java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), 
> lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in 
> [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, 
> false)))#3]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:532)
>   at 
> 

[jira] [Commented] (SPARK-42851) EquivalentExpressions methods need to be consistently guarded by supportedExpression

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702657#comment-17702657
 ] 

Apache Spark commented on SPARK-42851:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/40488

> EquivalentExpressions methods need to be consistently guarded by 
> supportedExpression
> 
>
> Key: SPARK-42851
> URL: https://issues.apache.org/jira/browse/SPARK-42851
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Kris Mok
>Priority: Major
>
> SPARK-41468 tried to fix a bug but introduced a new regression. Its change to 
> {{EquivalentExpressions}} added a {{supportedExpression()}} guard to the 
> {{addExprTree()}} and {{getExprState()}} methods, but didn't add the same 
> guard to the other "add" entry point -- {{addExpr()}}.
> As such, uses that add single expressions to CSE via {{addExpr()}} may 
> succeed, but upon retrieval via {{getExprState()}} it'd inconsistently get a 
> {{None}} due to failing the guard.
> We need to make sure the "add" and "get" methods are consistent. It could be 
> done by one of:
> 1. Adding the same {{supportedExpression()}} guard to {{addExpr()}}, or
> 2. Removing the guard from {{getExprState()}}, relying solely on the guard on 
> the "add" path to make sure only intended state is added.
> (or other alternative refactorings to fuse the guard into various methods to 
> make it more efficient)
> There are pros and cons to the two directions above, because {{addExpr()}} 
> used to allow (potentially incorrect) more expressions to get CSE'd, making 
> it more restrictive may cause performance regressions (for the cases that 
> happened to work).
> Example:
> {code:sql}
> select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) 
> from range(2)
> {code}
> Running this query on Spark 3.2 branch returns the correct value:
> {code}
> scala> spark.sql("select max(transform(array(id), x -> x)), 
> max(transform(array(id), x -> x)) from range(2)").collect
> res0: Array[org.apache.spark.sql.Row] = 
> Array([WrappedArray(1),WrappedArray(1)])
> {code}
> Here, {{transform(array(id), x -> x)}} is an {{AggregateExpression}} that was 
> (potentially unsafely) recognized by {{addExpr()}} as a common subexpression, 
> and {{getExprState()}} doesn't do extra guarding, so during physical 
> planning, in {{PhysicalAggregation}} this expression gets CSE'd in both the 
> aggregation expression list and the result expressions list.
> {code}
> AdaptiveSparkPlan isFinalPlan=false
> +- SortAggregate(key=[], functions=[max(transform(array(id#0L), 
> lambdafunction(lambda x#1L, lambda x#1L, false)))])
>+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=11]
>   +- SortAggregate(key=[], functions=[partial_max(transform(array(id#0L), 
> lambdafunction(lambda x#1L, lambda x#1L, false)))])
>  +- Range (0, 2, step=1, splits=16)
> {code}
> Running the same query on current master triggers an error when binding the 
> result expression to the aggregate expression in the Aggregate operators (for 
> a WSCG-enabled operator like {{HashAggregateExec}}, the same error would show 
> up during codegen):
> {code}
> ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; aborting job
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 16) (ip-10-110-16-93.us-west-2.compute.internal executor driver): 
> java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), 
> lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in 
> [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, 
> false)))#3]
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:517)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1249)
>   at 
> org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1248)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:532)
>   at 
> 

[jira] [Resolved] (SPARK-42720) Refactor the withSequenceColumn

2023-03-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42720.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40456
[https://github.com/apache/spark/pull/40456]

> Refactor the withSequenceColumn
> ---
>
> Key: SPARK-42720
> URL: https://issues.apache.org/jira/browse/SPARK-42720
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42720) Refactor the withSequenceColumn

2023-03-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42720:


Assignee: Hyukjin Kwon

> Refactor the withSequenceColumn
> ---
>
> Key: SPARK-42720
> URL: https://issues.apache.org/jira/browse/SPARK-42720
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42791) Create golden file test framework for analysis

2023-03-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42791:
---

Assignee: Daniel

> Create golden file test framework for analysis
> --
>
> Key: SPARK-42791
> URL: https://issues.apache.org/jira/browse/SPARK-42791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>
> Here we track the work to add new golden file test support for the Spark 
> analyzer. Each golden file can contain a list of SQL queries followed by the 
> string representations of their analyzed logical plans.
>  
> This can be similar to Spark's existing `SQLQueryTestSuite` [1], but stopping 
> after analysis and listing analyzed plans as the results instead of fully 
> executing queries end-to-end. As another example, ZetaSQL has analyzer-based 
> golden file testing like this as well [2].
>  
> This way, any changes to analysis will show up as test diffs, which are easy 
> to spot in review and also easy to update automatically. This could help the 
> community together maintain the qualify of Apache Spark's query analysis.
>  
> [1] 
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala]
>  
> [2] 
> [https://github.com/google/zetasql/blob/master/zetasql/analyzer/testdata/limit.test].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42791) Create golden file test framework for analysis

2023-03-20 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42791.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40449
[https://github.com/apache/spark/pull/40449]

> Create golden file test framework for analysis
> --
>
> Key: SPARK-42791
> URL: https://issues.apache.org/jira/browse/SPARK-42791
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
> Fix For: 3.5.0
>
>
> Here we track the work to add new golden file test support for the Spark 
> analyzer. Each golden file can contain a list of SQL queries followed by the 
> string representations of their analyzed logical plans.
>  
> This can be similar to Spark's existing `SQLQueryTestSuite` [1], but stopping 
> after analysis and listing analyzed plans as the results instead of fully 
> executing queries end-to-end. As another example, ZetaSQL has analyzer-based 
> golden file testing like this as well [2].
>  
> This way, any changes to analysis will show up as test diffs, which are easy 
> to spot in review and also easy to update automatically. This could help the 
> community together maintain the qualify of Apache Spark's query analysis.
>  
> [1] 
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala]
>  
> [2] 
> [https://github.com/google/zetasql/blob/master/zetasql/analyzer/testdata/limit.test].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42340) Implement GroupedData.applyInPandas

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702632#comment-17702632
 ] 

Apache Spark commented on SPARK-42340:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40486

> Implement GroupedData.applyInPandas
> ---
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42340) Implement GroupedData.applyInPandas

2023-03-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42340.
--
  Assignee: Xinrong Meng
Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/40405

> Implement GroupedData.applyInPandas
> ---
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42340) Implement GroupedData.applyInPandas

2023-03-20 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42340:
-
Fix Version/s: 3.5.0

> Implement GroupedData.applyInPandas
> ---
>
> Key: SPARK-42340
> URL: https://issues.apache.org/jira/browse/SPARK-42340
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42870) Move `toCatalystValue` to connect-common

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42870:


Assignee: (was: Apache Spark)

> Move `toCatalystValue` to connect-common
> 
>
> Key: SPARK-42870
> URL: https://issues.apache.org/jira/browse/SPARK-42870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42870) Move `toCatalystValue` to connect-common

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702598#comment-17702598
 ] 

Apache Spark commented on SPARK-42870:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40485

> Move `toCatalystValue` to connect-common
> 
>
> Key: SPARK-42870
> URL: https://issues.apache.org/jira/browse/SPARK-42870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42870) Move `toCatalystValue` to connect-common

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42870:


Assignee: Apache Spark

> Move `toCatalystValue` to connect-common
> 
>
> Key: SPARK-42870
> URL: https://issues.apache.org/jira/browse/SPARK-42870
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42870) Move `toCatalystValue` to connect-common

2023-03-20 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42870:
-

 Summary: Move `toCatalystValue` to connect-common
 Key: SPARK-42870
 URL: https://issues.apache.org/jira/browse/SPARK-42870
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42869) can not analyze window exp on sub query

2023-03-20 Thread GuangWeiHong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GuangWeiHong updated SPARK-42869:
-
Description: 
 

CREATE TABLE test_noindex_table(`name` STRING,`age` INT,`city` STRING) 
PARTITIONED BY (`date` STRING);

 

SELECT
    *
FROM
(
    SELECT *, COUNT(1) OVER itr AS grp_size
    FROM test_noindex_table 
    WINDOW itr AS (PARTITION BY city)
) tbl
WINDOW itr2 AS (PARTITION BY
    city
)
 
Window specification itr is not defined in the WINDOW clause.
  !image-2023-03-20-18-00-40-578.png|width=560,height=361!

  was:
 
SELECT * FROM ( SELECT *, COUNT(1) OVER itr AS grp_size FROM test WINDOW itr AS 
(PARTITION BY model) ) tbl WINDOW itr2 AS (PARTITION BY model )
 
Window specification itr is not defined in the WINDOW clause.
  !image-2023-03-20-18-00-40-578.png|width=560,height=361!


> can not analyze window exp on sub query
> ---
>
> Key: SPARK-42869
> URL: https://issues.apache.org/jira/browse/SPARK-42869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GuangWeiHong
>Priority: Major
> Attachments: image-2023-03-20-18-00-40-578.png
>
>
>  
> CREATE TABLE test_noindex_table(`name` STRING,`age` INT,`city` STRING) 
> PARTITIONED BY (`date` STRING);
>  
> SELECT
>     *
> FROM
> (
>     SELECT *, COUNT(1) OVER itr AS grp_size
>     FROM test_noindex_table 
>     WINDOW itr AS (PARTITION BY city)
> ) tbl
> WINDOW itr2 AS (PARTITION BY
>     city
> )
>  
> Window specification itr is not defined in the WINDOW clause.
>   !image-2023-03-20-18-00-40-578.png|width=560,height=361!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42869) can not analyze window exp on sub query

2023-03-20 Thread GuangWeiHong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GuangWeiHong updated SPARK-42869:
-
Description: 
 
SELECT * FROM ( SELECT *, COUNT(1) OVER itr AS grp_size FROM test WINDOW itr AS 
(PARTITION BY model) ) tbl WINDOW itr2 AS (PARTITION BY model )
 
Window specification itr is not defined in the WINDOW clause.
  !image-2023-03-20-18-00-40-578.png|width=560,height=361!

  was:
 
SELECT * FROM ( SELECT *, COUNT(1) OVER itr AS grp_size FROM test WINDOW itr AS 
(PARTITION BY model) ) tbl WINDOW itr2 AS (PARTITION BY model )
 
Window specification itr is not defined in the WINDOW clause.
 


> can not analyze window exp on sub query
> ---
>
> Key: SPARK-42869
> URL: https://issues.apache.org/jira/browse/SPARK-42869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GuangWeiHong
>Priority: Major
> Attachments: image-2023-03-20-18-00-40-578.png
>
>
>  
> SELECT * FROM ( SELECT *, COUNT(1) OVER itr AS grp_size FROM test WINDOW itr 
> AS (PARTITION BY model) ) tbl WINDOW itr2 AS (PARTITION BY model )
>  
> Window specification itr is not defined in the WINDOW clause.
>   !image-2023-03-20-18-00-40-578.png|width=560,height=361!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42869) can not analyze window exp on sub query

2023-03-20 Thread GuangWeiHong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GuangWeiHong updated SPARK-42869:
-
Attachment: image-2023-03-20-18-00-40-578.png

> can not analyze window exp on sub query
> ---
>
> Key: SPARK-42869
> URL: https://issues.apache.org/jira/browse/SPARK-42869
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: GuangWeiHong
>Priority: Major
> Attachments: image-2023-03-20-18-00-40-578.png
>
>
>  
> SELECT * FROM ( SELECT *, COUNT(1) OVER itr AS grp_size FROM test WINDOW itr 
> AS (PARTITION BY model) ) tbl WINDOW itr2 AS (PARTITION BY model )
>  
> Window specification itr is not defined in the WINDOW clause.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42869) can not analyze window exp on sub query

2023-03-20 Thread GuangWeiHong (Jira)
GuangWeiHong created SPARK-42869:


 Summary: can not analyze window exp on sub query
 Key: SPARK-42869
 URL: https://issues.apache.org/jira/browse/SPARK-42869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: GuangWeiHong


 
SELECT * FROM ( SELECT *, COUNT(1) OVER itr AS grp_size FROM test WINDOW itr AS 
(PARTITION BY model) ) tbl WINDOW itr2 AS (PARTITION BY model )
 
Window specification itr is not defined in the WINDOW clause.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38973) When push-based shuffle is enabled, a stage may not complete when retried

2023-03-20 Thread Li Ying (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702556#comment-17702556
 ] 

Li Ying commented on SPARK-38973:
-

[~csingh] Should this bugfix be merged into 3.2.x branches?

> When push-based shuffle is enabled, a stage may not complete when retried
> -
>
> Key: SPARK-38973
> URL: https://issues.apache.org/jira/browse/SPARK-38973
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.3.0
>
>
> With push-based shuffle enabled and adaptive merge finalization, there are 
> scenarios where a re-attempt of ShuffleMapStage may not complete. 
> With Adaptive Merge Finalization, a stage may be triggered for finalization 
> when it is in the below state:
>  # The stage is *not* running ({*}not{*} in the _running_ set of the 
> DAGScheduler) - had failed or canceled or waiting, and
>  # The stage has no pending partitions (all the tasks completed at-least 
> once).
> For such a stage when the finalization completes, the stage will still not be 
> marked as {_}mergeFinalized{_}. 
> The stage of the stage will be: 
>  * _stage.shuffleDependency.mergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask = finalizeTask_
>  * Merged statuses of the state are unregistered
>  
> When the stage is resubmitted, the newer attempt of the stage will never 
> complete even though its tasks may be completed. This is because the newer 
> attempt of the stage will have {_}shuffleMergeEnabled = true{_}, since with 
> the previous attempt the stage was never marked as {_}mergedFinalized{_}, and 
> the _finalizeTask_ is present (from finalization attempt for previous stage 
> attempt).
>  
> So, when all the tasks of the newer attempt complete, then these conditions 
> will be true:
>  * stage will be running
>  * There will be no pending partitions since all the tasks completed
>  * _stage.shuffleDependency.shuffleMergeEnabled = true_
>  * _stage.shuffleDependency.shuffleMergeFinalized = false_
>  * _stage.shuffleDependency.getFinalizeTask_ is not empty
> This leads the DAGScheduler to try scheduling finalization and not trigger 
> the completion of the Stage. However because of the last condition it never 
> even schedules the finalization and the stage never completes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42868) Support eliminate sorts in AQE Optimizer

2023-03-20 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702545#comment-17702545
 ] 

Apache Spark commented on SPARK-42868:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/40484

> Support eliminate sorts in AQE Optimizer
> 
>
> Key: SPARK-42868
> URL: https://issues.apache.org/jira/browse/SPARK-42868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42868) Support eliminate sorts in AQE Optimizer

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42868:


Assignee: Apache Spark

> Support eliminate sorts in AQE Optimizer
> 
>
> Key: SPARK-42868
> URL: https://issues.apache.org/jira/browse/SPARK-42868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42868) Support eliminate sorts in AQE Optimizer

2023-03-20 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42868:


Assignee: (was: Apache Spark)

> Support eliminate sorts in AQE Optimizer
> 
>
> Key: SPARK-42868
> URL: https://issues.apache.org/jira/browse/SPARK-42868
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42852) Revert NamedLambdaVariable related changes from EquivalentExpressions

2023-03-20 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-42852:
---
Affects Version/s: (was: 3.3.2)

> Revert NamedLambdaVariable related changes from EquivalentExpressions
> -
>
> Key: SPARK-42852
> URL: https://issues.apache.org/jira/browse/SPARK-42852
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
> Fix For: 3.4.0
>
>
> See discussion 
> https://github.com/apache/spark/pull/40473#issuecomment-1474848224



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42827) Support `functions#array_prepend`

2023-03-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42827.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40481
[https://github.com/apache/spark/pull/40481]

> Support `functions#array_prepend`
> -
>
> Key: SPARK-42827
> URL: https://issues.apache.org/jira/browse/SPARK-42827
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> Wait for SPARK-41233



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42827) Support `functions#array_prepend`

2023-03-20 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42827:
-

Assignee: Yang Jie

> Support `functions#array_prepend`
> -
>
> Key: SPARK-42827
> URL: https://issues.apache.org/jira/browse/SPARK-42827
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> Wait for SPARK-41233



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42868) Support eliminate sorts in AQE Optimizer

2023-03-20 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-42868:
---

 Summary: Support eliminate sorts in AQE Optimizer
 Key: SPARK-42868
 URL: https://issues.apache.org/jira/browse/SPARK-42868
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42864) Review and fix issues in MLlib API docs

2023-03-20 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng reassigned SPARK-42864:


Assignee: (was: Ruifeng Zheng)

> Review and fix issues in MLlib API docs
> ---
>
> Key: SPARK-42864
> URL: https://issues.apache.org/jira/browse/SPARK-42864
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >