date:20220905

[jira] [Updated] (SPARK-40161) Make Series.mode apply PandasMode

2022-09-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40161:
--
Parent: SPARK-40327
Issue Type: Sub-task  (was: Improvement)

> Make Series.mode apply PandasMode
> -
>
> Key: SPARK-40161
> URL: https://issues.apache.org/jira/browse/SPARK-40161
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40138) Implement DataFrame.mode

2022-09-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40138:
--
Parent: SPARK-40327
Issue Type: Sub-task  (was: Improvement)

> Implement DataFrame.mode
> 
>
> Key: SPARK-40138
> URL: https://issues.apache.org/jira/browse/SPARK-40138
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40333) Implement `GroupBy.nth`.

2022-09-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40333:
-

Assignee: Ruifeng Zheng

> Implement `GroupBy.nth`.
> 
>
> Key: SPARK-40333
> URL: https://issues.apache.org/jira/browse/SPARK-40333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> We should implement `GroupBy.nth` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40313) ps.DataFrame(data, index) should support the same anchor

2022-09-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40313:
--
Parent: SPARK-40327
Issue Type: Sub-task  (was: Improvement)

> ps.DataFrame(data, index) should support the same anchor
> 
>
> Key: SPARK-40313
> URL: https://issues.apache.org/jira/browse/SPARK-40313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40135) Support ps.Index in DataFrame creation

2022-09-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-40135:
--
Parent: SPARK-40327
Issue Type: Sub-task  (was: Improvement)

> Support ps.Index in DataFrame creation
> --
>
> Key: SPARK-40135
> URL: https://issues.apache.org/jira/browse/SPARK-40135
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark

2022-09-05 Thread Ruifeng Zheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600242#comment-17600242
 ] 

Ruifeng Zheng commented on SPARK-40327:
---

also cc [~dc-heros] [~dchvn]

> Increase pandas API coverage for pandas API on Spark
> 
>
> Key: SPARK-40327
> URL: https://issues.apache.org/jira/browse/SPARK-40327
> Project: Spark
>  Issue Type: Umbrella
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Increasing the pandas API coverage for Apache Spark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40333) Implement `GroupBy.nth`.

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40333:


Assignee: Apache Spark  (was: Ruifeng Zheng)

> Implement `GroupBy.nth`.
> 
>
> Key: SPARK-40333
> URL: https://issues.apache.org/jira/browse/SPARK-40333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should implement `GroupBy.nth` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40333) Implement `GroupBy.nth`.

2022-09-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600268#comment-17600268
 ] 

Apache Spark commented on SPARK-40333:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37801

> Implement `GroupBy.nth`.
> 
>
> Key: SPARK-40333
> URL: https://issues.apache.org/jira/browse/SPARK-40333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> We should implement `GroupBy.nth` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40333) Implement `GroupBy.nth`.

2022-09-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600269#comment-17600269
 ] 

Apache Spark commented on SPARK-40333:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37801

> Implement `GroupBy.nth`.
> 
>
> Key: SPARK-40333
> URL: https://issues.apache.org/jira/browse/SPARK-40333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> We should implement `GroupBy.nth` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40333) Implement `GroupBy.nth`.

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40333:


Assignee: Ruifeng Zheng  (was: Apache Spark)

> Implement `GroupBy.nth`.
> 
>
> Key: SPARK-40333
> URL: https://issues.apache.org/jira/browse/SPARK-40333
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> We should implement `GroupBy.nth` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40291) Improve the message for column not in group by clause error

2022-09-05 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-40291:
-
Parent: SPARK-37935
Issue Type: Sub-task  (was: Task)

> Improve the message for column not in group by clause error
> ---
>
> Key: SPARK-40291
> URL: https://issues.apache.org/jira/browse/SPARK-40291
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Linhong Liu
>Priority: Major
>
> Improve the message for column not in group by clause error to use the new 
> error framework



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries

2022-09-05 Thread Oleksandr Shevchenko (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600297#comment-17600297
 ] 

Oleksandr Shevchenko commented on SPARK-39995:
--

It definitely matters. It impacts the dependencies/packages we can use (e.g. 
DataSourceV2 API implementation for read and write). It impacts DX (Developer 
Experience) and installation including CD process for our code.

> PySpark installation doesn't support Scala 2.13 binaries
> 
>
> Key: SPARK-39995
> URL: https://issues.apache.org/jira/browse/SPARK-39995
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Oleksandr Shevchenko
>Priority: Major
>
> [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary 
> [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi]
>  for Scala 2.13.
> Currently, the setup 
> [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py]
>  allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror 
> (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always 
> Scala 2.12 compatible binaries. There isn't any parameter to download 
> "spark-3.3.0-bin-hadoop3-scala2.13.tgz".
> It's possible to download Spark manually and set the needed SPARK_HOME, but 
> it's hard to use with pip or Poetry.
> Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI 
> but not possible with package managers like Poetry.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40350) Improve the configuration of volcano scheduler

2022-09-05 Thread Sun BiaoBiao (Jira)

Sun BiaoBiao created SPARK-40350:


 Summary: Improve the configuration of volcano scheduler
 Key: SPARK-40350
 URL: https://issues.apache.org/jira/browse/SPARK-40350
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Sun BiaoBiao


Now we use volcano as our scheduler， we need  specify  the following 
configuration options:

 

 
{code:java}
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml

spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
 {code}
 

 

we should use configMap to mount the /path/to/podgroup-template.yaml file

 

If we use spark config to specify the parameters of the podgroup, it will be 
much more convenient, so that we can mount static files without using configmap

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40350) Improve the configuration of volcano scheduler

2022-09-05 Thread Sun BiaoBiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun BiaoBiao updated SPARK-40350:
-
Description: 
Now we use volcano as our scheduler， we need  specify  the following 
configuration options:

 

 
{code:java}
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml

spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
 {code}
 

 

we should use configMap to mount the /path/to/podgroup-template.yaml file

 

If we use spark config to specify the parameters of the podgroup, it will be 
much more convenient,we don't need configmap to mount static files

 

 

 

  was:
Now we use volcano as our scheduler， we need  specify  the following 
configuration options:

 

 
{code:java}
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml

spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
 {code}
 

 

we should use configMap to mount the /path/to/podgroup-template.yaml file

 

If we use spark config to specify the parameters of the podgroup, it will be 
much more convenient, so that we don't need configmap to mount static files

 

 

 


> Improve the configuration of volcano scheduler
> --
>
> Key: SPARK-40350
> URL: https://issues.apache.org/jira/browse/SPARK-40350
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Sun BiaoBiao
>Priority: Major
>
> Now we use volcano as our scheduler， we need  specify  the following 
> configuration options:
>  
>  
> {code:java}
> spark.kubernetes.scheduler.name=volcano
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
> spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
> spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
>  {code}
>  
>  
> we should use configMap to mount the /path/to/podgroup-template.yaml file
>  
> If we use spark config to specify the parameters of the podgroup, it will be 
> much more convenient,we don't need configmap to mount static files
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40350) Improve the configuration of volcano scheduler

2022-09-05 Thread Sun BiaoBiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun BiaoBiao updated SPARK-40350:
-
Description: 
Now we use volcano as our scheduler， we need  specify  the following 
configuration options:

 

 
{code:java}
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml

spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
 {code}
 

 

we should use configMap to mount the /path/to/podgroup-template.yaml file

 

If we use spark config to specify the parameters of the podgroup, it will be 
much more convenient, so that we don't need configmap to mount static files

 

 

 

  was:
Now we use volcano as our scheduler， we need  specify  the following 
configuration options:

 

 
{code:java}
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml

spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
 {code}
 

 

we should use configMap to mount the /path/to/podgroup-template.yaml file

 

If we use spark config to specify the parameters of the podgroup, it will be 
much more convenient, so that we can mount static files without using configmap

 

 

 


> Improve the configuration of volcano scheduler
> --
>
> Key: SPARK-40350
> URL: https://issues.apache.org/jira/browse/SPARK-40350
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Sun BiaoBiao
>Priority: Major
>
> Now we use volcano as our scheduler， we need  specify  the following 
> configuration options:
>  
>  
> {code:java}
> spark.kubernetes.scheduler.name=volcano
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
> spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
> spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
>  {code}
>  
>  
> we should use configMap to mount the /path/to/podgroup-template.yaml file
>  
> If we use spark config to specify the parameters of the podgroup, it will be 
> much more convenient, so that we don't need configmap to mount static files
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40350) Improve the configuration of volcano scheduler

2022-09-05 Thread Sun BiaoBiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun BiaoBiao updated SPARK-40350:
-
Description: 
Now we use volcano as our scheduler， we need  specify  the following 
configuration options:

 

 
{code:java}
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml

spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
 {code}
 

 

we should use configMap to mount the /path/to/podgroup-template.yaml file

 

If we use spark config to specify the parameters of the podgroup, it will be 
much more convenient,we don't need configmap to mount static files

 

In our scenario, we need to dynamically specify the volcano queue, but it is 
not convenient to create a static podgroup configuration file to mount 

 

 

 

  was:
Now we use volcano as our scheduler， we need  specify  the following 
configuration options:

 

 
{code:java}
spark.kubernetes.scheduler.name=volcano
spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml

spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
 {code}
 

 

we should use configMap to mount the /path/to/podgroup-template.yaml file

 

If we use spark config to specify the parameters of the podgroup, it will be 
much more convenient,we don't need configmap to mount static files

 

 

 


> Improve the configuration of volcano scheduler
> --
>
> Key: SPARK-40350
> URL: https://issues.apache.org/jira/browse/SPARK-40350
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Sun BiaoBiao
>Priority: Major
>
> Now we use volcano as our scheduler， we need  specify  the following 
> configuration options:
>  
>  
> {code:java}
> spark.kubernetes.scheduler.name=volcano
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
> spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
> spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
>  {code}
>  
>  
> we should use configMap to mount the /path/to/podgroup-template.yaml file
>  
> If we use spark config to specify the parameters of the podgroup, it will be 
> much more convenient,we don't need configmap to mount static files
>  
> In our scenario, we need to dynamically specify the volcano queue, but it is 
> not convenient to create a static podgroup configuration file to mount 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup

2022-09-05 Thread Sun BiaoBiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun BiaoBiao updated SPARK-40350:
-
Summary: Use spark config to configure the parameters of volcano podgroup  
(was: Improve the configuration of volcano scheduler)

> Use spark config to configure the parameters of volcano podgroup
> 
>
> Key: SPARK-40350
> URL: https://issues.apache.org/jira/browse/SPARK-40350
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Sun BiaoBiao
>Priority: Major
>
> Now we use volcano as our scheduler， we need  specify  the following 
> configuration options:
>  
>  
> {code:java}
> spark.kubernetes.scheduler.name=volcano
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
> spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
> spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
>  {code}
>  
>  
> we should use configMap to mount the /path/to/podgroup-template.yaml file
>  
> If we use spark config to specify the parameters of the podgroup, it will be 
> much more convenient,we don't need configmap to mount static files
>  
> In our scenario, we need to dynamically specify the volcano queue, but it is 
> not convenient to create a static podgroup configuration file to mount 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup

2022-09-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600306#comment-17600306
 ] 

Apache Spark commented on SPARK-40350:
--

User 'zheniantoushipashi' has created a pull request for this issue:
https://github.com/apache/spark/pull/37802

> Use spark config to configure the parameters of volcano podgroup
> 
>
> Key: SPARK-40350
> URL: https://issues.apache.org/jira/browse/SPARK-40350
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Sun BiaoBiao
>Priority: Major
>
> Now we use volcano as our scheduler， we need  specify  the following 
> configuration options:
>  
>  
> {code:java}
> spark.kubernetes.scheduler.name=volcano
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
> spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
> spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
>  {code}
>  
>  
> we should use configMap to mount the /path/to/podgroup-template.yaml file
>  
> If we use spark config to specify the parameters of the podgroup, it will be 
> much more convenient,we don't need configmap to mount static files
>  
> In our scenario, we need to dynamically specify the volcano queue, but it is 
> not convenient to create a static podgroup configuration file to mount 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40350:


Assignee: Apache Spark

> Use spark config to configure the parameters of volcano podgroup
> 
>
> Key: SPARK-40350
> URL: https://issues.apache.org/jira/browse/SPARK-40350
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Sun BiaoBiao
>Assignee: Apache Spark
>Priority: Major
>
> Now we use volcano as our scheduler， we need  specify  the following 
> configuration options:
>  
>  
> {code:java}
> spark.kubernetes.scheduler.name=volcano
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
> spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
> spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
>  {code}
>  
>  
> we should use configMap to mount the /path/to/podgroup-template.yaml file
>  
> If we use spark config to specify the parameters of the podgroup, it will be 
> much more convenient,we don't need configmap to mount static files
>  
> In our scenario, we need to dynamically specify the volcano queue, but it is 
> not convenient to create a static podgroup configuration file to mount 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40350:


Assignee: (was: Apache Spark)

> Use spark config to configure the parameters of volcano podgroup
> 
>
> Key: SPARK-40350
> URL: https://issues.apache.org/jira/browse/SPARK-40350
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Sun BiaoBiao
>Priority: Major
>
> Now we use volcano as our scheduler， we need  specify  the following 
> configuration options:
>  
>  
> {code:java}
> spark.kubernetes.scheduler.name=volcano
> spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
> spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
> spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
>  {code}
>  
>  
> we should use configMap to mount the /path/to/podgroup-template.yaml file
>  
> If we use spark config to specify the parameters of the podgroup, it will be 
> much more convenient,we don't need configmap to mount static files
>  
> In our scenario, we need to dynamically specify the volcano queue, but it is 
> not convenient to create a static podgroup configuration file to mount 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor

2022-09-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600324#comment-17600324
 ] 

Apache Spark commented on SPARK-39546:
--

User 'fanyilun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37803

> Respect port defininitions on K8S pod templates for both driver and executor
> 
>
> Key: SPARK-39546
> URL: https://issues.apache.org/jira/browse/SPARK-39546
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Oliver Koeth
>Priority: Minor
>
> *Description:*
> Spark on K8S allows to open additional ports for custom purposes on the 
> driver pod via the pod template, but ignores the port specification in the 
> executor pod template. Port specifications from the pod template should be 
> preserved (and extended) for both drivers and executors.
> *Scenario:*
> I want to run functionality in the executor that exposes data on an 
> additional port. In my case, this is monitoring data exposed by Spark's JMX 
> metrics sink via the JMX prometheus exporter java agent 
> https://github.com/prometheus/jmx_exporter -- the java agent opens an extra 
> port inside the container, but for prometheus to detect and scrape the port, 
> it must be exposed in the K8S pod resource.
> (More background if desired: This seems to be the "classic" Spark 2 way to 
> expose prometheus metrics. Spark 3 introduced a native equivalent servlet for 
> the driver, but for the executor, only a rather limited set of metrics is 
> forwarded via the driver, and that also follows a completely different naming 
> scheme. So the JMX + exporter approach still turns out to be more useful for 
> me, even in Spark 3)
> Expected behavior:
> I add the following to my pod template to expose the extra port opened by the 
> JMX exporter java agent
> spec:
>   containers:
>   - ...
>     ports:
>     - containerPort: 8090
>       name: jmx-prometheus
>       protocol: TCP
> Observed behavior:
> The port is exposed for driver pods but not for executor pods
> *Corresponding code:*
> driver pod creation just adds ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala]
>  (currently line 115)
> val driverContainer = new ContainerBuilder(pod.container)
> ...
>   .addNewPort()
> ...
>   .addNewPort()
> while executor pod creation replaces the ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala]
>  (currently line 211)
> val executorContainer = new ContainerBuilder(pod.container)
> ...
>   .withPorts(requiredPorts.asJava)
> The current handling is incosistent and unnecessarily limiting. It seems that 
> the executor creation could/should just as well preserve pods from the 
> template and add extra required ports.
> *Workaround:*
> It is possible to work around this limitation by adding a full sidecar 
> container to the executor pod spec which declares the port. Sidecar 
> containers are left unchanged by pod template handling.
> As all containers in a pod share the same network, it does not matter which 
> container actually declares to expose the port.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39546:


Assignee: Apache Spark

> Respect port defininitions on K8S pod templates for both driver and executor
> 
>
> Key: SPARK-39546
> URL: https://issues.apache.org/jira/browse/SPARK-39546
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Oliver Koeth
>Assignee: Apache Spark
>Priority: Minor
>
> *Description:*
> Spark on K8S allows to open additional ports for custom purposes on the 
> driver pod via the pod template, but ignores the port specification in the 
> executor pod template. Port specifications from the pod template should be 
> preserved (and extended) for both drivers and executors.
> *Scenario:*
> I want to run functionality in the executor that exposes data on an 
> additional port. In my case, this is monitoring data exposed by Spark's JMX 
> metrics sink via the JMX prometheus exporter java agent 
> https://github.com/prometheus/jmx_exporter -- the java agent opens an extra 
> port inside the container, but for prometheus to detect and scrape the port, 
> it must be exposed in the K8S pod resource.
> (More background if desired: This seems to be the "classic" Spark 2 way to 
> expose prometheus metrics. Spark 3 introduced a native equivalent servlet for 
> the driver, but for the executor, only a rather limited set of metrics is 
> forwarded via the driver, and that also follows a completely different naming 
> scheme. So the JMX + exporter approach still turns out to be more useful for 
> me, even in Spark 3)
> Expected behavior:
> I add the following to my pod template to expose the extra port opened by the 
> JMX exporter java agent
> spec:
>   containers:
>   - ...
>     ports:
>     - containerPort: 8090
>       name: jmx-prometheus
>       protocol: TCP
> Observed behavior:
> The port is exposed for driver pods but not for executor pods
> *Corresponding code:*
> driver pod creation just adds ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala]
>  (currently line 115)
> val driverContainer = new ContainerBuilder(pod.container)
> ...
>   .addNewPort()
> ...
>   .addNewPort()
> while executor pod creation replaces the ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala]
>  (currently line 211)
> val executorContainer = new ContainerBuilder(pod.container)
> ...
>   .withPorts(requiredPorts.asJava)
> The current handling is incosistent and unnecessarily limiting. It seems that 
> the executor creation could/should just as well preserve pods from the 
> template and add extra required ports.
> *Workaround:*
> It is possible to work around this limitation by adding a full sidecar 
> container to the executor pod spec which declares the port. Sidecar 
> containers are left unchanged by pod template handling.
> As all containers in a pod share the same network, it does not matter which 
> container actually declares to expose the port.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39546:


Assignee: (was: Apache Spark)

> Respect port defininitions on K8S pod templates for both driver and executor
> 
>
> Key: SPARK-39546
> URL: https://issues.apache.org/jira/browse/SPARK-39546
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Oliver Koeth
>Priority: Minor
>
> *Description:*
> Spark on K8S allows to open additional ports for custom purposes on the 
> driver pod via the pod template, but ignores the port specification in the 
> executor pod template. Port specifications from the pod template should be 
> preserved (and extended) for both drivers and executors.
> *Scenario:*
> I want to run functionality in the executor that exposes data on an 
> additional port. In my case, this is monitoring data exposed by Spark's JMX 
> metrics sink via the JMX prometheus exporter java agent 
> https://github.com/prometheus/jmx_exporter -- the java agent opens an extra 
> port inside the container, but for prometheus to detect and scrape the port, 
> it must be exposed in the K8S pod resource.
> (More background if desired: This seems to be the "classic" Spark 2 way to 
> expose prometheus metrics. Spark 3 introduced a native equivalent servlet for 
> the driver, but for the executor, only a rather limited set of metrics is 
> forwarded via the driver, and that also follows a completely different naming 
> scheme. So the JMX + exporter approach still turns out to be more useful for 
> me, even in Spark 3)
> Expected behavior:
> I add the following to my pod template to expose the extra port opened by the 
> JMX exporter java agent
> spec:
>   containers:
>   - ...
>     ports:
>     - containerPort: 8090
>       name: jmx-prometheus
>       protocol: TCP
> Observed behavior:
> The port is exposed for driver pods but not for executor pods
> *Corresponding code:*
> driver pod creation just adds ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala]
>  (currently line 115)
> val driverContainer = new ContainerBuilder(pod.container)
> ...
>   .addNewPort()
> ...
>   .addNewPort()
> while executor pod creation replaces the ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala]
>  (currently line 211)
> val executorContainer = new ContainerBuilder(pod.container)
> ...
>   .withPorts(requiredPorts.asJava)
> The current handling is incosistent and unnecessarily limiting. It seems that 
> the executor creation could/should just as well preserve pods from the 
> template and add extra required ports.
> *Workaround:*
> It is possible to work around this limitation by adding a full sidecar 
> container to the executor pod spec which declares the port. Sidecar 
> containers are left unchanged by pod template handling.
> As all containers in a pod share the same network, it does not matter which 
> container actually declares to expose the port.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor

2022-09-05 Thread Yilun Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600346#comment-17600346
 ] 

Yilun Fan commented on SPARK-39546:
---

I made a PR, I think it can resolve this issue.

> Respect port defininitions on K8S pod templates for both driver and executor
> 
>
> Key: SPARK-39546
> URL: https://issues.apache.org/jira/browse/SPARK-39546
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Oliver Koeth
>Priority: Minor
>
> *Description:*
> Spark on K8S allows to open additional ports for custom purposes on the 
> driver pod via the pod template, but ignores the port specification in the 
> executor pod template. Port specifications from the pod template should be 
> preserved (and extended) for both drivers and executors.
> *Scenario:*
> I want to run functionality in the executor that exposes data on an 
> additional port. In my case, this is monitoring data exposed by Spark's JMX 
> metrics sink via the JMX prometheus exporter java agent 
> https://github.com/prometheus/jmx_exporter -- the java agent opens an extra 
> port inside the container, but for prometheus to detect and scrape the port, 
> it must be exposed in the K8S pod resource.
> (More background if desired: This seems to be the "classic" Spark 2 way to 
> expose prometheus metrics. Spark 3 introduced a native equivalent servlet for 
> the driver, but for the executor, only a rather limited set of metrics is 
> forwarded via the driver, and that also follows a completely different naming 
> scheme. So the JMX + exporter approach still turns out to be more useful for 
> me, even in Spark 3)
> Expected behavior:
> I add the following to my pod template to expose the extra port opened by the 
> JMX exporter java agent
> spec:
>   containers:
>   - ...
>     ports:
>     - containerPort: 8090
>       name: jmx-prometheus
>       protocol: TCP
> Observed behavior:
> The port is exposed for driver pods but not for executor pods
> *Corresponding code:*
> driver pod creation just adds ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala]
>  (currently line 115)
> val driverContainer = new ContainerBuilder(pod.container)
> ...
>   .addNewPort()
> ...
>   .addNewPort()
> while executor pod creation replaces the ports
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala]
>  (currently line 211)
> val executorContainer = new ContainerBuilder(pod.container)
> ...
>   .withPorts(requiredPorts.asJava)
> The current handling is incosistent and unnecessarily limiting. It seems that 
> the executor creation could/should just as well preserve pods from the 
> template and add extra required ports.
> *Workaround:*
> It is possible to work around this limitation by adding a full sidecar 
> container to the executor pod spec which declares the port. Sidecar 
> containers are left unchanged by pod template handling.
> As all containers in a pod share the same network, it does not matter which 
> container actually declares to expose the port.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40335) Implement `DataFrameGroupBy.corr`.

2022-09-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40335:
-

Assignee: Ruifeng Zheng

> Implement `DataFrameGroupBy.corr`.
> --
>
> Key: SPARK-40335
> URL: https://issues.apache.org/jira/browse/SPARK-40335
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> We should implement `DataFrameGroupBy.corr` for increasing pandas API 
> coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.corr.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40336) Implement `DataFrameGroupBy.cov`.

2022-09-05 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40336:
-

Assignee: Ruifeng Zheng

> Implement `DataFrameGroupBy.cov`.
> -
>
> Key: SPARK-40336
> URL: https://issues.apache.org/jira/browse/SPARK-40336
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> We should implement `DataFrameGroupBy.cov` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cov.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40322) Fix all dead links

2022-09-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600367#comment-17600367
 ] 

Yang Jie commented on SPARK-40322:
--

The links related to `Spark Summit` have now been redirected to 
https://www.databricks.com/dataaisummit/. Is it better to keep the links, or to 
remove the links and only keep the text?

> Fix all dead links
> --
>
> Key: SPARK-40322
> URL: https://issues.apache.org/jira/browse/SPARK-40322
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
>  
> https://www.deadlinkchecker.com/website-dead-link-checker.asp
>  
>  
> ||Status||URL||Source link text||
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using
>  Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]|
> |404 Not 
> Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]|
> |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir 
> University|https://spark.apache.org/powered-by.html]|
> |404 Not Found|[http://nsn.com/]|[Nokia Solutions and 
> Networks|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.nubetech.co/]|[Nube 
> Technologies|https://spark.apache.org/powered-by.html]|
> |-1 Timeout|[http://ooyala.com/]|[Ooyala, 
> Inc.|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark
>  for Fast Queries|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sisa.samsung.com/]|[Samsung Research 
> America|https://spark.apache.org/powered-by.html]|
> |-1 
> Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP 
> Camp 2 [302 from 
> http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 
> from 
> http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from 
> http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from 
> http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://www.packtpub.com/product/spark-cookbook/9781783987061]|[Spark 
> Cookbook [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]|[Apache
>  Spark Graph Processing [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark
>  Summit Europe|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing 
> with Spark|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring 
> Spark's logs|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]|
> |-1 
> Timeou

[jira] [Created] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-09-05 Thread Tymofii (Jira)

Tymofii created SPARK-40351:
---

 Summary: Spark Sum increases the precision of DecimalType 
arguments by 10
 Key: SPARK-40351
 URL: https://issues.apache.org/jira/browse/SPARK-40351
 Project: Spark
  Issue Type: Question
  Components: Optimizer
Affects Versions: 3.2.0
Reporter: Tymofii


Currently in Spark automatically increases Decimal field by 10 (hard coded 
value) after SUM aggregate operation - 
[https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]

There are a couple of questions:
 # Why was 10 chosen as default one?
 # Is it make sense to allow the user to override this value via configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40322) Fix all dead links

2022-09-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600370#comment-17600370
 ] 

Yang Jie commented on SPARK-40322:
--

Many historical links on the news page are no longer accessible

> Fix all dead links
> --
>
> Key: SPARK-40322
> URL: https://issues.apache.org/jira/browse/SPARK-40322
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
>  
> https://www.deadlinkchecker.com/website-dead-link-checker.asp
>  
>  
> ||Status||URL||Source link text||
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using
>  Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]|
> |404 Not 
> Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]|
> |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir 
> University|https://spark.apache.org/powered-by.html]|
> |404 Not Found|[http://nsn.com/]|[Nokia Solutions and 
> Networks|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.nubetech.co/]|[Nube 
> Technologies|https://spark.apache.org/powered-by.html]|
> |-1 Timeout|[http://ooyala.com/]|[Ooyala, 
> Inc.|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark
>  for Fast Queries|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sisa.samsung.com/]|[Samsung Research 
> America|https://spark.apache.org/powered-by.html]|
> |-1 
> Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP 
> Camp 2 [302 from 
> http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 
> from 
> http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from 
> http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from 
> http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://www.packtpub.com/product/spark-cookbook/9781783987061]|[Spark 
> Cookbook [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]|[Apache
>  Spark Graph Processing [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark
>  Summit Europe|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing 
> with Spark|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring 
> Spark's logs|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html]|[Shark|https://spark.apache.o

[jira] [Commented] (SPARK-40322) Fix all dead links

2022-09-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600384#comment-17600384
 ] 

Yang Jie commented on SPARK-40322:
--

[https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook] 
and 

[https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]
 not dead links

> Fix all dead links
> --
>
> Key: SPARK-40322
> URL: https://issues.apache.org/jira/browse/SPARK-40322
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
>  
> https://www.deadlinkchecker.com/website-dead-link-checker.asp
>  
>  
> ||Status||URL||Source link text||
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using
>  Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]|
> |404 Not 
> Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]|
> |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir 
> University|https://spark.apache.org/powered-by.html]|
> |404 Not Found|[http://nsn.com/]|[Nokia Solutions and 
> Networks|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.nubetech.co/]|[Nube 
> Technologies|https://spark.apache.org/powered-by.html]|
> |-1 Timeout|[http://ooyala.com/]|[Ooyala, 
> Inc.|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark
>  for Fast Queries|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sisa.samsung.com/]|[Samsung Research 
> America|https://spark.apache.org/powered-by.html]|
> |-1 
> Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP 
> Camp 2 [302 from 
> http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 
> from 
> http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from 
> http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from 
> http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://www.packtpub.com/product/spark-cookbook/9781783987061]|[Spark 
> Cookbook [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]|[Apache
>  Spark Graph Processing [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark
>  Summit Europe|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing 
> with Spark|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring 
> Spark's logs|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]|
> |-1 
> T

[jira] [Updated] (SPARK-40322) Fix all dead links

2022-09-05 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40322:
-
Description: 
 

[https://www.deadlinkchecker.com/website-dead-link-checker.asp]

 

 
||Status||URL||Source link text||
|-1 Not found: The server name or address could not be 
resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using
 Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]|
|404 Not 
Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]|
|-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir 
University|https://spark.apache.org/powered-by.html]|
|404 Not Found|[http://nsn.com/]|[Nokia Solutions and 
Networks|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://www.nubetech.co/]|[Nube 
Technologies|https://spark.apache.org/powered-by.html]|
|-1 Timeout|[http://ooyala.com/]|[Ooyala, 
Inc.|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark
 for Fast Queries|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://www.sisa.samsung.com/]|[Samsung Research 
America|https://spark.apache.org/powered-by.html]|
|-1 
Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP 
Camp 2 [302 from 
http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 
from 
http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from 
http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from 
http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]|
|-500 Internal Server 
Error-|-[https://www.packtpub.com/product/spark-cookbook/9781783987061]-|-[Spark
 Cookbook [301 from 
https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]-|
|-500 Internal Server 
Error-|-[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]-|-[Apache
 Spark Graph Processing [301 from 
https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]-|
|500 Internal Server 
Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]|
|500 Internal Server 
Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]|
|500 Internal Server 
Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]|
|500 Internal Server 
Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark
 Summit Europe|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]|
|-1 Not found: The server name or address could not be 
resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing with 
Spark|https://spark.apache.org/news/]|
|-1 Not found: The server name or address could not be 
resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring Spark's 
logs|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html]|[Shark|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strata.oreilly.com/2012/10/spark-0-6-improves-performance-and-accessibility.html]|[Spark
 0.6 release|https://spark.apache.org/news/]|
|404 Not 
Found|[http://data-informed.com/spark-an-open-source-engine-for-iterative-data-mining/]|[DataInformed|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27438]|[introduction
 to Spark, Shark and BDAS|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27440]|[hands-on
 exercise session|h

[jira] [Comment Edited] (SPARK-40322) Fix all dead links

2022-09-05 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600384#comment-17600384
 ] 

Yang Jie edited comment on SPARK-40322 at 9/5/22 12:51 PM:
---

[https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook] 
and 

[https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]

[https://www.packtpub.com/big-data-and-business-intelligence/big-data-analytics]

not dead links


was (Author: luciferyang):
[https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook] 
and 

[https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]
 not dead links

> Fix all dead links
> --
>
> Key: SPARK-40322
> URL: https://issues.apache.org/jira/browse/SPARK-40322
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
>  
> [https://www.deadlinkchecker.com/website-dead-link-checker.asp]
>  
>  
> ||Status||URL||Source link text||
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using
>  Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]|
> |404 Not 
> Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]|
> |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir 
> University|https://spark.apache.org/powered-by.html]|
> |404 Not Found|[http://nsn.com/]|[Nokia Solutions and 
> Networks|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.nubetech.co/]|[Nube 
> Technologies|https://spark.apache.org/powered-by.html]|
> |-1 Timeout|[http://ooyala.com/]|[Ooyala, 
> Inc.|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark
>  for Fast Queries|https://spark.apache.org/powered-by.html]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://www.sisa.samsung.com/]|[Samsung Research 
> America|https://spark.apache.org/powered-by.html]|
> |-1 
> Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP 
> Camp 2 [302 from 
> http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 
> from 
> http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from 
> http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]|
> |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from 
> http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]|
> |-500 Internal Server 
> Error-|-[https://www.packtpub.com/product/spark-cookbook/9781783987061]-|-[Spark
>  Cookbook [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]-|
> |-500 Internal Server 
> Error-|-[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]-|-[Apache
>  Spark Graph Processing [301 from 
> https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]-|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]|
> |500 Internal Server 
> Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark
>  Summit Europe|https://spark.apache.org/news/]|
> |-1 
> Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]|
> |-1 Not found: The server name or address could not be 
> resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing

[jira] [Updated] (SPARK-40322) Fix all dead links

2022-09-05 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40322:
-
Description: 
 

[https://www.deadlinkchecker.com/website-dead-link-checker.asp]

 

 
||Status||URL||Source link text||
|-1 Not found: The server name or address could not be 
resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using
 Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]|
|404 Not 
Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]|
|-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir 
University|https://spark.apache.org/powered-by.html]|
|404 Not Found|[http://nsn.com/]|[Nokia Solutions and 
Networks|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://www.nubetech.co/]|[Nube 
Technologies|https://spark.apache.org/powered-by.html]|
|-1 Timeout|[http://ooyala.com/]|[Ooyala, 
Inc.|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark
 for Fast Queries|https://spark.apache.org/powered-by.html]|
|-1 Not found: The server name or address could not be 
resolved|[http://www.sisa.samsung.com/]|[Samsung Research 
America|https://spark.apache.org/powered-by.html]|
|-1 
Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP 
Camp 2 [302 from 
http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 
from 
http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from 
http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]|
|404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from 
http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]|
|-500 Internal Server 
Error-|-[https://www.packtpub.com/product/spark-cookbook/9781783987061]-|-[Spark
 Cookbook [301 from 
https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]-|
|-500 Internal Server 
Error-|-[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]-|-[Apache
 Spark Graph Processing [301 from 
https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]-|
|500 Internal Server 
Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]|
|500 Internal Server 
Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]|
|500 Internal Server 
Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]|
|500 Internal Server 
Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark
 Summit Europe|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]|
|-1 Not found: The server name or address could not be 
resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing with 
Spark|https://spark.apache.org/news/]|
|-1 Not found: The server name or address could not be 
resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring Spark's 
logs|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html]|[Shark|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strata.oreilly.com/2012/10/spark-0-6-improves-performance-and-accessibility.html]|[Spark
 0.6 release|https://spark.apache.org/news/]|
|404 Not 
Found|[http://data-informed.com/spark-an-open-source-engine-for-iterative-data-mining/]|[DataInformed|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27438]|[introduction
 to Spark, Shark and BDAS|https://spark.apache.org/news/]|
|-1 
Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27440]|[hands-on
 exercise session|h

[jira] [Assigned] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40352:


Assignee: Max Gekk  (was: Apache Spark)

> Add function aliases: len, datepart, dateadd, date_diff and curdate
> ---
>
> Key: SPARK-40352
> URL: https://issues.apache.org/jira/browse/SPARK-40352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The functions len, datepart, dateadd, date_diff and curdate exist in other 
> systems, and Spark SQL has similar functions. So, adding such aliases will 
> make the migration to Spark SQL easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate

2022-09-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40352:


Assignee: Apache Spark  (was: Max Gekk)

> Add function aliases: len, datepart, dateadd, date_diff and curdate
> ---
>
> Key: SPARK-40352
> URL: https://issues.apache.org/jira/browse/SPARK-40352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The functions len, datepart, dateadd, date_diff and curdate exist in other 
> systems, and Spark SQL has similar functions. So, adding such aliases will 
> make the migration to Spark SQL easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate

2022-09-05 Thread Max Gekk (Jira)

Max Gekk created SPARK-40352:


 Summary: Add function aliases: len, datepart, dateadd, date_diff 
and curdate
 Key: SPARK-40352
 URL: https://issues.apache.org/jira/browse/SPARK-40352
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


The functions len, datepart, dateadd, date_diff and curdate exist in other 
systems, and Spark SQL has similar functions. So, adding such aliases will make 
the migration to Spark SQL easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate

2022-09-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600458#comment-17600458
 ] 

Apache Spark commented on SPARK-40352:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37804

> Add function aliases: len, datepart, dateadd, date_diff and curdate
> ---
>
> Key: SPARK-40352
> URL: https://issues.apache.org/jira/browse/SPARK-40352
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> The functions len, datepart, dateadd, date_diff and curdate exist in other 
> systems, and Spark SQL has similar functions. So, adding such aliases will 
> make the migration to Spark SQL easier.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-09-05 Thread Kyle Kent (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600496#comment-17600496
 ] 

Kyle Kent commented on SPARK-38004:
---

[~itholic] I can create a PR for this. Should the change fit here in this 
function? 
https://github.com/apache/spark/blob/f9409ce7d49c25718317298031c84d1c8d6317af/python/pyspark/pandas/namespace.py#:~:text=internally.-,mangle_dupe_cols%20%3A%20bool%2C%20default%20True,are%20duplicate%20names%20in%20the%20columns.,-**kwds%20%3A%20optional

 

I'm thinking of adding it as a note after the mangle_dup_col parameter.

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40353) Re-enable the `read_excel` tests

2022-09-05 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-40353:
---

 Summary: Re-enable the `read_excel` tests
 Key: SPARK-40353
 URL: https://issues.apache.org/jira/browse/SPARK-40353
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


So far, we've been skipping the `read_excel` test in pandas API on Spark:
https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251

In https://github.com/apache/spark/pull/37671, we installing `openpyxl==3.0.10` 
to re-enable the `read_excel` tests, but it's still failing for some reason 
(Please see https://github.com/apache/spark/pull/37671#issuecomment-1237515485 
for more detail).

We should re-enable this test for improving the pandas-on-Spark test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40353) Re-enable the `read_excel` tests

2022-09-05 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40353:

Description: 
So far, we've been skipping the `read_excel` test in pandas API on Spark:
https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251

In https://github.com/apache/spark/pull/37671, we installed `openpyxl==3.0.10` 
to re-enable the `read_excel` tests, but it's still failing for some reason 
(Please see https://github.com/apache/spark/pull/37671#issuecomment-1237515485 
for more detail).

We should re-enable this test for improving the pandas-on-Spark test coverage.

  was:
So far, we've been skipping the `read_excel` test in pandas API on Spark:
https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251

In https://github.com/apache/spark/pull/37671, we installing `openpyxl==3.0.10` 
to re-enable the `read_excel` tests, but it's still failing for some reason 
(Please see https://github.com/apache/spark/pull/37671#issuecomment-1237515485 
for more detail).

We should re-enable this test for improving the pandas-on-Spark test coverage.


> Re-enable the `read_excel` tests
> 
>
> Key: SPARK-40353
> URL: https://issues.apache.org/jira/browse/SPARK-40353
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> So far, we've been skipping the `read_excel` test in pandas API on Spark:
> https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251
> In https://github.com/apache/spark/pull/37671, we installed 
> `openpyxl==3.0.10` to re-enable the `read_excel` tests, but it's still 
> failing for some reason (Please see 
> https://github.com/apache/spark/pull/37671#issuecomment-1237515485 for more 
> detail).
> We should re-enable this test for improving the pandas-on-Spark test coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.

2022-09-05 Thread Haejoon Lee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600528#comment-17600528
 ] 

Haejoon Lee commented on SPARK-38004:
-

[~kentkr] Yes, I think adding a note for the parameter looks good enough for 
now.

Please go ahead to create a PR and ping me. I'm willing to review for this :)

> read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns 
> but fails if the duplicate columns are case sensitive.
> 
>
> Key: SPARK-38004
> URL: https://issues.apache.org/jira/browse/SPARK-38004
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Saikrishna Pujari
>Priority: Minor
>
> mangle_dupe_cols - default is True
> So ideally it should have handled duplicate columns, but in case the columns 
> are case sensitive it fails as below.
> AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be 
> Sheet.col, Sheet.col.
> Where two columns are Col and cOL
> In the best practices, there is a mention of not to use case sensitive 
> columns - 
> [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names]
> Either the docs for read_excel/mangle_dupe_cols have to be updated about this 
> or it has to be handled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS

2022-09-05 Thread Rajanikant Vellaturi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600530#comment-17600530
 ] 

Rajanikant Vellaturi commented on SPARK-36999:
--

Hi [~maxgekk] , Can I work on this? Please let me know. Thanks

> Document the command ALTER TABLE RECOVER PARTITIONS
> ---
>
> Key: SPARK-36999
> URL: https://issues.apache.org/jira/browse/SPARK-36999
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>  Labels: starter
>
> Update the page 
> [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] 
> and document the command ALTER TABLE RECOVER PARTITIONS



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-09-05 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600535#comment-17600535
 ] 

Yuming Wang commented on SPARK-40351:
-

https://github.com/apache/spark/blob/v3.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala#L52-L53

Why do you want to override this value?

> Spark Sum increases the precision of DecimalType arguments by 10
> 
>
> Key: SPARK-40351
> URL: https://issues.apache.org/jira/browse/SPARK-40351
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Tymofii
>Priority: Minor
>
> Currently in Spark automatically increases Decimal field by 10 (hard coded 
> value) after SUM aggregate operation - 
> [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]
> There are a couple of questions:
>  # Why was 10 chosen as default one?
>  # Is it make sense to allow the user to override this value via 
> configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-40326:


Assignee: Bjørn Jørgensen

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-40326:
-
Priority: Minor  (was: Major)

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4

2022-09-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-40326.
--
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37796
[https://github.com/apache/spark/pull/37796]

> upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 
> to 2.13.4
> --
>
> Key: SPARK-40326
> URL: https://issues.apache.org/jira/browse/SPARK-40326
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>
> [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857]
> [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39671) insert overwrite table java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadPartition .This problem occurred when we installed Apache Spark3.0.

2022-09-05 Thread Iqbal Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600541#comment-17600541
 ] 

Iqbal Singh commented on SPARK-39671:
-

Is there a way to reproduce it, or is this something specific to Cloudera 
distribution only.

 

> insert overwrite table java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition .This problem occurred 
> when we installed Apache Spark3.0.1-hadoop3.0 in CDH6.1.1  
> 
>
> Key: SPARK-39671
> URL: https://issues.apache.org/jira/browse/SPARK-39671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: xin
>Priority: Major
>
> use spark-thrifter  run this sql  insert overwrite table xx.xx 
> partition(dt=2022-06-30) select * from xxx.xxx;   The SQL execution 
> environment is cdh 6.1.1  hive  version 2.1.1
>  
>  
>  raise OperationalError(response)  pyhive.exc.OperationalError: 
> TExecuteStatementResp(status=TStatus(statusCode=3, 
> infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error running 
> query: java.lang.NoSuchMethodException: 
> org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path,
>  java.lang.String, java.util.Map, boolean, boolean, boolean, boolean, 
> boolean, boolean):25:24', 
> 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:321',
>  
> 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:202',
>  'org.apache.hive.service.cli.operation.Operation:run:Operation.java:278', 
> 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkOperation$$super$run:SparkExecuteStatementOperation.scala:46',
>  
> 'org.apache.spark.sql.hive.thriftserver.SparkOperation:$anonfun$run$1:SparkOperation.scala:44',
>  'scala.runtime.java8.JFunction0$mcV$sp:apply:JFunction0$mcV$sp.java:23', 
> 'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties:SparkOperation.scala:78',
>  
> 'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties$:SparkOperation.scala:62',
>  
> 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:withLocalProperties:SparkExecuteStatementOperation.scala:46',
>  
> 'org.apache.spark.sql.hive.thriftserver.SparkOperation:run:SparkOperation.scala:44',
>  
> 'org.apache.spark.sql.hive.thriftserver.SparkOperation:run$:SparkOperation.scala:42',
>  
> 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:run:SparkExecuteStatementOperation.scala:46',
>  
> 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:484',
>  
> 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:460',
>  
> 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:280',
>  
> 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:439',
>  
> 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437',
>  
> 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422',
>  'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:38', 
> 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', 
> 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:53',
>  
> 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:310',
>  
> 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149',
>  
> 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624',
>  'java.lang.Thread:run:Thread.java:748', 
> '*java.lang.NoSuchMethodException:org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path,
>  java.lang.String, java.util.Map, boolean, boolean, boolean, boolean, 
> boolean, boolean):63:38', 'java.lang.Class:getMethod:Class.java:1786', 
> 'org.apache.spark.sql.hive.client.Shim:findMethod:HiveShim.scala:177', 
> 'org.apache.spark.sql.hive.client.Shim_v2_1:loadPartitionMethod$lzycompute:HiveShim.scala:1151',
>  
> 'org.apache.spark.sql.hive.client.Shim_v2_1:loadPartitionMethod:HiveShim.scala:1139',
>  
> 'org.apache.spark.sql.hive.client.Shim_v2_1:loadPartition:HiveShim.scala:1201',
>  
> 'org.apache.spark.sql.hive.client.HiveClientImpl:$anonfun$loadPartition$1:HiveClientImpl.scala:872',
>  'scala.runtime.ja

[jira] [Resolved] (SPARK-40313) ps.DataFrame(data, index) should support the same anchor

2022-09-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40313.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37768
[https://github.com/apache/spark/pull/37768]

> ps.DataFrame(data, index) should support the same anchor
> 
>
> Key: SPARK-40313
> URL: https://issues.apache.org/jira/browse/SPARK-40313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40313) ps.DataFrame(data, index) should support the same anchor

2022-09-05 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40313:


Assignee: Ruifeng Zheng

> ps.DataFrame(data, index) should support the same anchor
> 
>
> Key: SPARK-40313
> URL: https://issues.apache.org/jira/browse/SPARK-40313
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39752) Spark job failed with 10M rows data with Broken pipe error

2022-09-05 Thread Iqbal Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600543#comment-17600543
 ] 

Iqbal Singh commented on SPARK-39752:
-

[~sshukla05] , Could you please provide the stack trace for the issue or a way 
to reproduce the error.

> Spark job failed with 10M rows data with Broken pipe error
> --
>
> Key: SPARK-39752
> URL: https://issues.apache.org/jira/browse/SPARK-39752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.3, 3.2.1
>Reporter: SHOBHIT SHUKLA
>Priority: Major
> Fix For: 3.0.2
>
>
> Spark job failed with 10M rows data with Broken pipe error. Same spark job 
> was working previously with the settings "executor_cores": 1, 
> "executor_memory": 1, "driver_cores": 1, "driver_memory": 1. where as the 
> same job is failing with spark settings in 3.0.3 and 3.2.1.
> Major symptoms (slowness, timeout, out of memory as examples): Spark job is 
> failing with the error java.net.SocketException: Broken pipe (Write failed)
> Here are the spark settings information which is working on Spark 3.0.3 and 
> 3.2.1 : "executor_cores": 4, "executor_memory": 4, "driver_cores": 4, 
> "driver_memory": 4.. The spark job doesn't consistently works with the above 
> settings. Some times, need to increase the cores and memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-09-05 Thread Tymofii (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tymofii updated SPARK-40351:

Description: 
Currently in Spark automatically increases Decimal field by 10 (hard coded 
value) after SUM aggregate operation - 
[https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]

There are a couple of questions:
 # Why was 10 chosen as default one?
 # Does it make sense to allow the user to override this value via 
configuration? 

  was:
Currently in Spark automatically increases Decimal field by 10 (hard coded 
value) after SUM aggregate operation - 
[https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]

There are a couple of questions:
 # Why was 10 chosen as default one?
 # Is it make sense to allow the user to override this value via configuration? 


> Spark Sum increases the precision of DecimalType arguments by 10
> 
>
> Key: SPARK-40351
> URL: https://issues.apache.org/jira/browse/SPARK-40351
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Tymofii
>Priority: Minor
>
> Currently in Spark automatically increases Decimal field by 10 (hard coded 
> value) after SUM aggregate operation - 
> [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]
> There are a couple of questions:
>  # Why was 10 chosen as default one?
>  # Does it make sense to allow the user to override this value via 
> configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38404) Spark does not find CTE inside nested CTE

2022-09-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-38404:

Fix Version/s: 3.3.1

> Spark does not find CTE inside nested CTE
> -
>
> Key: SPARK-38404
> URL: https://issues.apache.org/jira/browse/SPARK-38404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
>  * MacOS Monterrey 12.2.1 (21D62)
>  * python 3.9.10
>  * pip 22.0.3
>  * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and 
> 3.1.3 (SQL query works)
>Reporter: Joan Heredia Rius
>Assignee: Peter Toth
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>
> Hello! 
> Seems that when defining CTEs and using them inside another CTE in Spark SQL, 
> Spark thinks the inner call for the CTE is a table or view, which is not 
> found and then it errors with `Table or view not found: `
> h3. Steps to reproduce
>  # `pip install pyspark==3.2.0` (also happens with 3.2.1)
>  # start pyspark console by typing `pyspark` in the terminal
>  # Try to run the following SQL with `spark.sql(sql)`
>  
> {code:java}
>   WITH mock_cte__usersAS (
>SELECT 1 AS id
>),
>model_under_test  AS (
>  WITH usersAS (
>   SELECT *
> FROM mock_cte__users
>   )
>SELECT *
>  FROM users
>)
> SELECT *
>   FROM model_under_test;{code}
> Spark will fail with 
>  
> {code:java}
> pyspark.sql.utils.AnalysisException: Table or view not found: 
> mock_cte__users; line 8 pos 29; {code}
> I don't know if this is a regression or an expected behavior of the new 3.2.* 
> versions. This fix introduced in 3.2.0 might be related: 
> https://issues.apache.org/jira/browse/SPARK-36447
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40297) CTE outer reference nested in CTE main body cannot be resolved

2022-09-05 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-40297.
-
Fix Version/s: 3.4.0
   3.3.1
 Assignee: Wei Xue
   Resolution: Fixed

> CTE outer reference nested in CTE main body cannot be resolved
> --
>
> Key: SPARK-40297
> URL: https://issues.apache.org/jira/browse/SPARK-40297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.4.0, 3.3.1
>
>
> AnalysisException "Table or view not found" is thrown when a CTE reference 
> occurs in an inner CTE definition nested in the outer CTE's main body FROM 
> clause. E.g.,
> {code}
> WITH cte_outer AS (
>   SELECT 1
> )
> SELECT * FROM (
>   WITH cte_inner AS (
>     SELECT * FROM cte_outer
>   )
>   SELECT * FROM cte_inner
> )
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10

2022-09-05 Thread Tymofii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600572#comment-17600572
 ] 

Tymofii commented on SPARK-40351:
-

# Not sure I understood why you showed those lines of code.
 # For example, the issue we faced is as follows. Source tables have decimal 
fields with the certain precision defined during the table creation. There are 
number of queries, which are used to extract and transform the data from those 
source tables and load it to the target one, which also has a decimal field 
with the same precision as in the sources tables. So the users knows for sure, 
that summing values in the source decimal fields may not result in exceeding 
the target table field precision. Currently they have to add explicit casting 
after SUM function to comply with the target table definition since our ETL 
flow would fail. It may be not very convenient if there are multiple queries. 
So they could disable automatic increase of the precision in this case for 
example.
 # Another question - what is the rationale behind the number 10?

> Spark Sum increases the precision of DecimalType arguments by 10
> 
>
> Key: SPARK-40351
> URL: https://issues.apache.org/jira/browse/SPARK-40351
> Project: Spark
>  Issue Type: Question
>  Components: Optimizer
>Affects Versions: 3.2.0
>Reporter: Tymofii
>Priority: Minor
>
> Currently in Spark automatically increases Decimal field by 10 (hard coded 
> value) after SUM aggregate operation - 
> [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.]
> There are a couple of questions:
>  # Why was 10 chosen as default one?
>  # Does it make sense to allow the user to override this value via 
> configuration? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE

2022-09-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39830.
---
Fix Version/s: 3.4.0
 Assignee: dzcxzl
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/37800

> Reading ORC table that requires type promotion may throw AIOOBE
> ---
>
> Key: SPARK-39830
> URL: https://issues.apache.org/jira/browse/SPARK-39830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.4.0
>
>
> We can add a UT to test the scenario after the ORC-1205 release.
>  
> bin/spark-shell
> {code:java}
> spark.sql("set orc.stripe.size=10240")
> spark.sql("set orc.rows.between.memory.checks=1")
> spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
> val df = spark.range(1, 1+512, 1, 1).map { i =>
>     if( i == 1 ){
>         (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
>     } else {
>         (i,Array.fill[Byte](1)('X'))
>     }
>     }.toDF("c1","c2")
> df.write.format("orc").save("file:///tmp/test_table_orc_t1")
> spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
> location 'file:///tmp/test_table_orc_t1' stored as orc ")
> spark.sql("select * from test_table_orc_t1").show() {code}
> Querying this table will get the following exception
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
>         at 
> org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
>         at 
> org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>         at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
>         at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39830) Add a test case to read ORC table that requires type promotion

2022-09-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39830:
--
Summary: Add a test case to read ORC table that requires type promotion  
(was: Reading ORC table that requires type promotion may throw AIOOBE)

> Add a test case to read ORC table that requires type promotion
> --
>
> Key: SPARK-39830
> URL: https://issues.apache.org/jira/browse/SPARK-39830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.4.0
>
>
> We can add a UT to test the scenario after the ORC-1205 release.
>  
> bin/spark-shell
> {code:java}
> spark.sql("set orc.stripe.size=10240")
> spark.sql("set orc.rows.between.memory.checks=1")
> spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
> val df = spark.range(1, 1+512, 1, 1).map { i =>
>     if( i == 1 ){
>         (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
>     } else {
>         (i,Array.fill[Byte](1)('X'))
>     }
>     }.toDF("c1","c2")
> df.write.format("orc").save("file:///tmp/test_table_orc_t1")
> spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
> location 'file:///tmp/test_table_orc_t1' stored as orc ")
> spark.sql("select * from test_table_orc_t1").show() {code}
> Querying this table will get the following exception
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
>         at 
> org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
>         at 
> org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>         at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
>         at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE

2022-09-05 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39830:
--
Component/s: Tests

> Reading ORC table that requires type promotion may throw AIOOBE
> ---
>
> Key: SPARK-39830
> URL: https://issues.apache.org/jira/browse/SPARK-39830
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.3.0
>Reporter: dzcxzl
>Assignee: dzcxzl
>Priority: Trivial
> Fix For: 3.4.0
>
>
> We can add a UT to test the scenario after the ORC-1205 release.
>  
> bin/spark-shell
> {code:java}
> spark.sql("set orc.stripe.size=10240")
> spark.sql("set orc.rows.between.memory.checks=1")
> spark.sql("set spark.sql.orc.columnarWriterBatchSize=1")
> val df = spark.range(1, 1+512, 1, 1).map { i =>
>     if( i == 1 ){
>         (i, Array.fill[Byte](5 * 1024 * 1024)('X'))
>     } else {
>         (i,Array.fill[Byte](1)('X'))
>     }
>     }.toDF("c1","c2")
> df.write.format("orc").save("file:///tmp/test_table_orc_t1")
> spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) 
> location 'file:///tmp/test_table_orc_t1' stored as orc ")
> spark.sql("select * from test_table_orc_t1").show() {code}
> Querying this table will get the following exception
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 1
>         at 
> org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387)
>         at 
> org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740)
>         at 
> org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>         at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>         at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84)
>         at 
> org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102)
>         at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

57 matches

Mail list logo