[jira] [Updated] (SPARK-40161) Make Series.mode apply PandasMode
[ https://issues.apache.org/jira/browse/SPARK-40161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-40161: -- Parent: SPARK-40327 Issue Type: Sub-task (was: Improvement) > Make Series.mode apply PandasMode > - > > Key: SPARK-40161 > URL: https://issues.apache.org/jira/browse/SPARK-40161 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40138) Implement DataFrame.mode
[ https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-40138: -- Parent: SPARK-40327 Issue Type: Sub-task (was: Improvement) > Implement DataFrame.mode > > > Key: SPARK-40138 > URL: https://issues.apache.org/jira/browse/SPARK-40138 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40333) Implement `GroupBy.nth`.
[ https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40333: - Assignee: Ruifeng Zheng > Implement `GroupBy.nth`. > > > Key: SPARK-40333 > URL: https://issues.apache.org/jira/browse/SPARK-40333 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > > We should implement `GroupBy.nth` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40313) ps.DataFrame(data, index) should support the same anchor
[ https://issues.apache.org/jira/browse/SPARK-40313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-40313: -- Parent: SPARK-40327 Issue Type: Sub-task (was: Improvement) > ps.DataFrame(data, index) should support the same anchor > > > Key: SPARK-40313 > URL: https://issues.apache.org/jira/browse/SPARK-40313 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40135) Support ps.Index in DataFrame creation
[ https://issues.apache.org/jira/browse/SPARK-40135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-40135: -- Parent: SPARK-40327 Issue Type: Sub-task (was: Improvement) > Support ps.Index in DataFrame creation > -- > > Key: SPARK-40135 > URL: https://issues.apache.org/jira/browse/SPARK-40135 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40327) Increase pandas API coverage for pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-40327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600242#comment-17600242 ] Ruifeng Zheng commented on SPARK-40327: --- also cc [~dc-heros] [~dchvn] > Increase pandas API coverage for pandas API on Spark > > > Key: SPARK-40327 > URL: https://issues.apache.org/jira/browse/SPARK-40327 > Project: Spark > Issue Type: Umbrella > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > Increasing the pandas API coverage for Apache Spark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40333) Implement `GroupBy.nth`.
[ https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40333: Assignee: Apache Spark (was: Ruifeng Zheng) > Implement `GroupBy.nth`. > > > Key: SPARK-40333 > URL: https://issues.apache.org/jira/browse/SPARK-40333 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We should implement `GroupBy.nth` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40333) Implement `GroupBy.nth`.
[ https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600268#comment-17600268 ] Apache Spark commented on SPARK-40333: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37801 > Implement `GroupBy.nth`. > > > Key: SPARK-40333 > URL: https://issues.apache.org/jira/browse/SPARK-40333 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > > We should implement `GroupBy.nth` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40333) Implement `GroupBy.nth`.
[ https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600269#comment-17600269 ] Apache Spark commented on SPARK-40333: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/37801 > Implement `GroupBy.nth`. > > > Key: SPARK-40333 > URL: https://issues.apache.org/jira/browse/SPARK-40333 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > > We should implement `GroupBy.nth` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40333) Implement `GroupBy.nth`.
[ https://issues.apache.org/jira/browse/SPARK-40333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40333: Assignee: Ruifeng Zheng (was: Apache Spark) > Implement `GroupBy.nth`. > > > Key: SPARK-40333 > URL: https://issues.apache.org/jira/browse/SPARK-40333 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > > We should implement `GroupBy.nth` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.nth.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40291) Improve the message for column not in group by clause error
[ https://issues.apache.org/jira/browse/SPARK-40291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-40291: - Parent: SPARK-37935 Issue Type: Sub-task (was: Task) > Improve the message for column not in group by clause error > --- > > Key: SPARK-40291 > URL: https://issues.apache.org/jira/browse/SPARK-40291 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Linhong Liu >Priority: Major > > Improve the message for column not in group by clause error to use the new > error framework -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39995) PySpark installation doesn't support Scala 2.13 binaries
[ https://issues.apache.org/jira/browse/SPARK-39995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600297#comment-17600297 ] Oleksandr Shevchenko commented on SPARK-39995: -- It definitely matters. It impacts the dependencies/packages we can use (e.g. DataSourceV2 API implementation for read and write). It impacts DX (Developer Experience) and installation including CD process for our code. > PySpark installation doesn't support Scala 2.13 binaries > > > Key: SPARK-39995 > URL: https://issues.apache.org/jira/browse/SPARK-39995 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Oleksandr Shevchenko >Priority: Major > > [PyPi|https://pypi.org/project/pyspark/] doesn't support Spark binary > [installation|https://spark.apache.org/docs/latest/api/python/getting_started/install.html#using-pypi] > for Scala 2.13. > Currently, the setup > [script|https://github.com/apache/spark/blob/master/python/pyspark/install.py] > allows to set versions of Spark, Hadoop (PYSPARK_HADOOP_VERSION), and mirror > (PYSPARK_RELEASE_MIRROR) to download needed Spark binaries, but it's always > Scala 2.12 compatible binaries. There isn't any parameter to download > "spark-3.3.0-bin-hadoop3-scala2.13.tgz". > It's possible to download Spark manually and set the needed SPARK_HOME, but > it's hard to use with pip or Poetry. > Also, env vars (e.g. PYSPARK_HADOOP_VERSION) are easy to use with pip and CLI > but not possible with package managers like Poetry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40350) Improve the configuration of volcano scheduler
Sun BiaoBiao created SPARK-40350: Summary: Improve the configuration of volcano scheduler Key: SPARK-40350 URL: https://issues.apache.org/jira/browse/SPARK-40350 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.3.0 Reporter: Sun BiaoBiao Now we use volcano as our scheduler, we need specify the following configuration options: {code:java} spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep {code} we should use configMap to mount the /path/to/podgroup-template.yaml file If we use spark config to specify the parameters of the podgroup, it will be much more convenient, so that we can mount static files without using configmap -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40350) Improve the configuration of volcano scheduler
[ https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun BiaoBiao updated SPARK-40350: - Description: Now we use volcano as our scheduler, we need specify the following configuration options: {code:java} spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep {code} we should use configMap to mount the /path/to/podgroup-template.yaml file If we use spark config to specify the parameters of the podgroup, it will be much more convenient,we don't need configmap to mount static files was: Now we use volcano as our scheduler, we need specify the following configuration options: {code:java} spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep {code} we should use configMap to mount the /path/to/podgroup-template.yaml file If we use spark config to specify the parameters of the podgroup, it will be much more convenient, so that we don't need configmap to mount static files > Improve the configuration of volcano scheduler > -- > > Key: SPARK-40350 > URL: https://issues.apache.org/jira/browse/SPARK-40350 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Sun BiaoBiao >Priority: Major > > Now we use volcano as our scheduler, we need specify the following > configuration options: > > > {code:java} > spark.kubernetes.scheduler.name=volcano > spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml > spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > {code} > > > we should use configMap to mount the /path/to/podgroup-template.yaml file > > If we use spark config to specify the parameters of the podgroup, it will be > much more convenient,we don't need configmap to mount static files > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40350) Improve the configuration of volcano scheduler
[ https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun BiaoBiao updated SPARK-40350: - Description: Now we use volcano as our scheduler, we need specify the following configuration options: {code:java} spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep {code} we should use configMap to mount the /path/to/podgroup-template.yaml file If we use spark config to specify the parameters of the podgroup, it will be much more convenient, so that we don't need configmap to mount static files was: Now we use volcano as our scheduler, we need specify the following configuration options: {code:java} spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep {code} we should use configMap to mount the /path/to/podgroup-template.yaml file If we use spark config to specify the parameters of the podgroup, it will be much more convenient, so that we can mount static files without using configmap > Improve the configuration of volcano scheduler > -- > > Key: SPARK-40350 > URL: https://issues.apache.org/jira/browse/SPARK-40350 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Sun BiaoBiao >Priority: Major > > Now we use volcano as our scheduler, we need specify the following > configuration options: > > > {code:java} > spark.kubernetes.scheduler.name=volcano > spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml > spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > {code} > > > we should use configMap to mount the /path/to/podgroup-template.yaml file > > If we use spark config to specify the parameters of the podgroup, it will be > much more convenient, so that we don't need configmap to mount static files > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40350) Improve the configuration of volcano scheduler
[ https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun BiaoBiao updated SPARK-40350: - Description: Now we use volcano as our scheduler, we need specify the following configuration options: {code:java} spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep {code} we should use configMap to mount the /path/to/podgroup-template.yaml file If we use spark config to specify the parameters of the podgroup, it will be much more convenient,we don't need configmap to mount static files In our scenario, we need to dynamically specify the volcano queue, but it is not convenient to create a static podgroup configuration file to mount was: Now we use volcano as our scheduler, we need specify the following configuration options: {code:java} spark.kubernetes.scheduler.name=volcano spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep {code} we should use configMap to mount the /path/to/podgroup-template.yaml file If we use spark config to specify the parameters of the podgroup, it will be much more convenient,we don't need configmap to mount static files > Improve the configuration of volcano scheduler > -- > > Key: SPARK-40350 > URL: https://issues.apache.org/jira/browse/SPARK-40350 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Sun BiaoBiao >Priority: Major > > Now we use volcano as our scheduler, we need specify the following > configuration options: > > > {code:java} > spark.kubernetes.scheduler.name=volcano > spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml > spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > {code} > > > we should use configMap to mount the /path/to/podgroup-template.yaml file > > If we use spark config to specify the parameters of the podgroup, it will be > much more convenient,we don't need configmap to mount static files > > In our scenario, we need to dynamically specify the volcano queue, but it is > not convenient to create a static podgroup configuration file to mount > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup
[ https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun BiaoBiao updated SPARK-40350: - Summary: Use spark config to configure the parameters of volcano podgroup (was: Improve the configuration of volcano scheduler) > Use spark config to configure the parameters of volcano podgroup > > > Key: SPARK-40350 > URL: https://issues.apache.org/jira/browse/SPARK-40350 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Sun BiaoBiao >Priority: Major > > Now we use volcano as our scheduler, we need specify the following > configuration options: > > > {code:java} > spark.kubernetes.scheduler.name=volcano > spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml > spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > {code} > > > we should use configMap to mount the /path/to/podgroup-template.yaml file > > If we use spark config to specify the parameters of the podgroup, it will be > much more convenient,we don't need configmap to mount static files > > In our scenario, we need to dynamically specify the volcano queue, but it is > not convenient to create a static podgroup configuration file to mount > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup
[ https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600306#comment-17600306 ] Apache Spark commented on SPARK-40350: -- User 'zheniantoushipashi' has created a pull request for this issue: https://github.com/apache/spark/pull/37802 > Use spark config to configure the parameters of volcano podgroup > > > Key: SPARK-40350 > URL: https://issues.apache.org/jira/browse/SPARK-40350 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Sun BiaoBiao >Priority: Major > > Now we use volcano as our scheduler, we need specify the following > configuration options: > > > {code:java} > spark.kubernetes.scheduler.name=volcano > spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml > spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > {code} > > > we should use configMap to mount the /path/to/podgroup-template.yaml file > > If we use spark config to specify the parameters of the podgroup, it will be > much more convenient,we don't need configmap to mount static files > > In our scenario, we need to dynamically specify the volcano queue, but it is > not convenient to create a static podgroup configuration file to mount > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup
[ https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40350: Assignee: Apache Spark > Use spark config to configure the parameters of volcano podgroup > > > Key: SPARK-40350 > URL: https://issues.apache.org/jira/browse/SPARK-40350 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Sun BiaoBiao >Assignee: Apache Spark >Priority: Major > > Now we use volcano as our scheduler, we need specify the following > configuration options: > > > {code:java} > spark.kubernetes.scheduler.name=volcano > spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml > spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > {code} > > > we should use configMap to mount the /path/to/podgroup-template.yaml file > > If we use spark config to specify the parameters of the podgroup, it will be > much more convenient,we don't need configmap to mount static files > > In our scenario, we need to dynamically specify the volcano queue, but it is > not convenient to create a static podgroup configuration file to mount > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40350) Use spark config to configure the parameters of volcano podgroup
[ https://issues.apache.org/jira/browse/SPARK-40350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40350: Assignee: (was: Apache Spark) > Use spark config to configure the parameters of volcano podgroup > > > Key: SPARK-40350 > URL: https://issues.apache.org/jira/browse/SPARK-40350 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Sun BiaoBiao >Priority: Major > > Now we use volcano as our scheduler, we need specify the following > configuration options: > > > {code:java} > spark.kubernetes.scheduler.name=volcano > spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml > spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep > {code} > > > we should use configMap to mount the /path/to/podgroup-template.yaml file > > If we use spark config to specify the parameters of the podgroup, it will be > much more convenient,we don't need configmap to mount static files > > In our scenario, we need to dynamically specify the volcano queue, but it is > not convenient to create a static podgroup configuration file to mount > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor
[ https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600324#comment-17600324 ] Apache Spark commented on SPARK-39546: -- User 'fanyilun' has created a pull request for this issue: https://github.com/apache/spark/pull/37803 > Respect port defininitions on K8S pod templates for both driver and executor > > > Key: SPARK-39546 > URL: https://issues.apache.org/jira/browse/SPARK-39546 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Oliver Koeth >Priority: Minor > > *Description:* > Spark on K8S allows to open additional ports for custom purposes on the > driver pod via the pod template, but ignores the port specification in the > executor pod template. Port specifications from the pod template should be > preserved (and extended) for both drivers and executors. > *Scenario:* > I want to run functionality in the executor that exposes data on an > additional port. In my case, this is monitoring data exposed by Spark's JMX > metrics sink via the JMX prometheus exporter java agent > https://github.com/prometheus/jmx_exporter -- the java agent opens an extra > port inside the container, but for prometheus to detect and scrape the port, > it must be exposed in the K8S pod resource. > (More background if desired: This seems to be the "classic" Spark 2 way to > expose prometheus metrics. Spark 3 introduced a native equivalent servlet for > the driver, but for the executor, only a rather limited set of metrics is > forwarded via the driver, and that also follows a completely different naming > scheme. So the JMX + exporter approach still turns out to be more useful for > me, even in Spark 3) > Expected behavior: > I add the following to my pod template to expose the extra port opened by the > JMX exporter java agent > spec: > containers: > - ... > ports: > - containerPort: 8090 > name: jmx-prometheus > protocol: TCP > Observed behavior: > The port is exposed for driver pods but not for executor pods > *Corresponding code:* > driver pod creation just adds ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala] > (currently line 115) > val driverContainer = new ContainerBuilder(pod.container) > ... > .addNewPort() > ... > .addNewPort() > while executor pod creation replaces the ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala] > (currently line 211) > val executorContainer = new ContainerBuilder(pod.container) > ... > .withPorts(requiredPorts.asJava) > The current handling is incosistent and unnecessarily limiting. It seems that > the executor creation could/should just as well preserve pods from the > template and add extra required ports. > *Workaround:* > It is possible to work around this limitation by adding a full sidecar > container to the executor pod spec which declares the port. Sidecar > containers are left unchanged by pod template handling. > As all containers in a pod share the same network, it does not matter which > container actually declares to expose the port. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor
[ https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39546: Assignee: Apache Spark > Respect port defininitions on K8S pod templates for both driver and executor > > > Key: SPARK-39546 > URL: https://issues.apache.org/jira/browse/SPARK-39546 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Oliver Koeth >Assignee: Apache Spark >Priority: Minor > > *Description:* > Spark on K8S allows to open additional ports for custom purposes on the > driver pod via the pod template, but ignores the port specification in the > executor pod template. Port specifications from the pod template should be > preserved (and extended) for both drivers and executors. > *Scenario:* > I want to run functionality in the executor that exposes data on an > additional port. In my case, this is monitoring data exposed by Spark's JMX > metrics sink via the JMX prometheus exporter java agent > https://github.com/prometheus/jmx_exporter -- the java agent opens an extra > port inside the container, but for prometheus to detect and scrape the port, > it must be exposed in the K8S pod resource. > (More background if desired: This seems to be the "classic" Spark 2 way to > expose prometheus metrics. Spark 3 introduced a native equivalent servlet for > the driver, but for the executor, only a rather limited set of metrics is > forwarded via the driver, and that also follows a completely different naming > scheme. So the JMX + exporter approach still turns out to be more useful for > me, even in Spark 3) > Expected behavior: > I add the following to my pod template to expose the extra port opened by the > JMX exporter java agent > spec: > containers: > - ... > ports: > - containerPort: 8090 > name: jmx-prometheus > protocol: TCP > Observed behavior: > The port is exposed for driver pods but not for executor pods > *Corresponding code:* > driver pod creation just adds ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala] > (currently line 115) > val driverContainer = new ContainerBuilder(pod.container) > ... > .addNewPort() > ... > .addNewPort() > while executor pod creation replaces the ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala] > (currently line 211) > val executorContainer = new ContainerBuilder(pod.container) > ... > .withPorts(requiredPorts.asJava) > The current handling is incosistent and unnecessarily limiting. It seems that > the executor creation could/should just as well preserve pods from the > template and add extra required ports. > *Workaround:* > It is possible to work around this limitation by adding a full sidecar > container to the executor pod spec which declares the port. Sidecar > containers are left unchanged by pod template handling. > As all containers in a pod share the same network, it does not matter which > container actually declares to expose the port. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor
[ https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39546: Assignee: (was: Apache Spark) > Respect port defininitions on K8S pod templates for both driver and executor > > > Key: SPARK-39546 > URL: https://issues.apache.org/jira/browse/SPARK-39546 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Oliver Koeth >Priority: Minor > > *Description:* > Spark on K8S allows to open additional ports for custom purposes on the > driver pod via the pod template, but ignores the port specification in the > executor pod template. Port specifications from the pod template should be > preserved (and extended) for both drivers and executors. > *Scenario:* > I want to run functionality in the executor that exposes data on an > additional port. In my case, this is monitoring data exposed by Spark's JMX > metrics sink via the JMX prometheus exporter java agent > https://github.com/prometheus/jmx_exporter -- the java agent opens an extra > port inside the container, but for prometheus to detect and scrape the port, > it must be exposed in the K8S pod resource. > (More background if desired: This seems to be the "classic" Spark 2 way to > expose prometheus metrics. Spark 3 introduced a native equivalent servlet for > the driver, but for the executor, only a rather limited set of metrics is > forwarded via the driver, and that also follows a completely different naming > scheme. So the JMX + exporter approach still turns out to be more useful for > me, even in Spark 3) > Expected behavior: > I add the following to my pod template to expose the extra port opened by the > JMX exporter java agent > spec: > containers: > - ... > ports: > - containerPort: 8090 > name: jmx-prometheus > protocol: TCP > Observed behavior: > The port is exposed for driver pods but not for executor pods > *Corresponding code:* > driver pod creation just adds ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala] > (currently line 115) > val driverContainer = new ContainerBuilder(pod.container) > ... > .addNewPort() > ... > .addNewPort() > while executor pod creation replaces the ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala] > (currently line 211) > val executorContainer = new ContainerBuilder(pod.container) > ... > .withPorts(requiredPorts.asJava) > The current handling is incosistent and unnecessarily limiting. It seems that > the executor creation could/should just as well preserve pods from the > template and add extra required ports. > *Workaround:* > It is possible to work around this limitation by adding a full sidecar > container to the executor pod spec which declares the port. Sidecar > containers are left unchanged by pod template handling. > As all containers in a pod share the same network, it does not matter which > container actually declares to expose the port. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor
[ https://issues.apache.org/jira/browse/SPARK-39546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600346#comment-17600346 ] Yilun Fan commented on SPARK-39546: --- I made a PR, I think it can resolve this issue. > Respect port defininitions on K8S pod templates for both driver and executor > > > Key: SPARK-39546 > URL: https://issues.apache.org/jira/browse/SPARK-39546 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.3.0 >Reporter: Oliver Koeth >Priority: Minor > > *Description:* > Spark on K8S allows to open additional ports for custom purposes on the > driver pod via the pod template, but ignores the port specification in the > executor pod template. Port specifications from the pod template should be > preserved (and extended) for both drivers and executors. > *Scenario:* > I want to run functionality in the executor that exposes data on an > additional port. In my case, this is monitoring data exposed by Spark's JMX > metrics sink via the JMX prometheus exporter java agent > https://github.com/prometheus/jmx_exporter -- the java agent opens an extra > port inside the container, but for prometheus to detect and scrape the port, > it must be exposed in the K8S pod resource. > (More background if desired: This seems to be the "classic" Spark 2 way to > expose prometheus metrics. Spark 3 introduced a native equivalent servlet for > the driver, but for the executor, only a rather limited set of metrics is > forwarded via the driver, and that also follows a completely different naming > scheme. So the JMX + exporter approach still turns out to be more useful for > me, even in Spark 3) > Expected behavior: > I add the following to my pod template to expose the extra port opened by the > JMX exporter java agent > spec: > containers: > - ... > ports: > - containerPort: 8090 > name: jmx-prometheus > protocol: TCP > Observed behavior: > The port is exposed for driver pods but not for executor pods > *Corresponding code:* > driver pod creation just adds ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala] > (currently line 115) > val driverContainer = new ContainerBuilder(pod.container) > ... > .addNewPort() > ... > .addNewPort() > while executor pod creation replaces the ports > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala] > (currently line 211) > val executorContainer = new ContainerBuilder(pod.container) > ... > .withPorts(requiredPorts.asJava) > The current handling is incosistent and unnecessarily limiting. It seems that > the executor creation could/should just as well preserve pods from the > template and add extra required ports. > *Workaround:* > It is possible to work around this limitation by adding a full sidecar > container to the executor pod spec which declares the port. Sidecar > containers are left unchanged by pod template handling. > As all containers in a pod share the same network, it does not matter which > container actually declares to expose the port. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40335) Implement `DataFrameGroupBy.corr`.
[ https://issues.apache.org/jira/browse/SPARK-40335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40335: - Assignee: Ruifeng Zheng > Implement `DataFrameGroupBy.corr`. > -- > > Key: SPARK-40335 > URL: https://issues.apache.org/jira/browse/SPARK-40335 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > > We should implement `DataFrameGroupBy.corr` for increasing pandas API > coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.corr.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40336) Implement `DataFrameGroupBy.cov`.
[ https://issues.apache.org/jira/browse/SPARK-40336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40336: - Assignee: Ruifeng Zheng > Implement `DataFrameGroupBy.cov`. > - > > Key: SPARK-40336 > URL: https://issues.apache.org/jira/browse/SPARK-40336 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > > We should implement `DataFrameGroupBy.cov` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.cov.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40322) Fix all dead links
[ https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600367#comment-17600367 ] Yang Jie commented on SPARK-40322: -- The links related to `Spark Summit` have now been redirected to https://www.databricks.com/dataaisummit/. Is it better to keep the links, or to remove the links and only keep the text? > Fix all dead links > -- > > Key: SPARK-40322 > URL: https://issues.apache.org/jira/browse/SPARK-40322 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > > https://www.deadlinkchecker.com/website-dead-link-checker.asp > > > ||Status||URL||Source link text|| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using > Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]| > |-1 Not found: The server name or address could not be > resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]| > |404 Not > Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]| > |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir > University|https://spark.apache.org/powered-by.html]| > |404 Not Found|[http://nsn.com/]|[Nokia Solutions and > Networks|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.nubetech.co/]|[Nube > Technologies|https://spark.apache.org/powered-by.html]| > |-1 Timeout|[http://ooyala.com/]|[Ooyala, > Inc.|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark > for Fast Queries|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sisa.samsung.com/]|[Samsung Research > America|https://spark.apache.org/powered-by.html]| > |-1 > Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP > Camp 2 [302 from > http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 > from > http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from > http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from > http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://www.packtpub.com/product/spark-cookbook/9781783987061]|[Spark > Cookbook [301 from > https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]|[Apache > Spark Graph Processing [301 from > https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark > Summit Europe|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]| > |-1 Not found: The server name or address could not be > resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing > with Spark|https://spark.apache.org/news/]| > |-1 Not found: The server name or address could not be > resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring > Spark's logs|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]| > |-1 > Timeou
[jira] [Created] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10
Tymofii created SPARK-40351: --- Summary: Spark Sum increases the precision of DecimalType arguments by 10 Key: SPARK-40351 URL: https://issues.apache.org/jira/browse/SPARK-40351 Project: Spark Issue Type: Question Components: Optimizer Affects Versions: 3.2.0 Reporter: Tymofii Currently in Spark automatically increases Decimal field by 10 (hard coded value) after SUM aggregate operation - [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.] There are a couple of questions: # Why was 10 chosen as default one? # Is it make sense to allow the user to override this value via configuration? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40322) Fix all dead links
[ https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600370#comment-17600370 ] Yang Jie commented on SPARK-40322: -- Many historical links on the news page are no longer accessible > Fix all dead links > -- > > Key: SPARK-40322 > URL: https://issues.apache.org/jira/browse/SPARK-40322 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > > https://www.deadlinkchecker.com/website-dead-link-checker.asp > > > ||Status||URL||Source link text|| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using > Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]| > |-1 Not found: The server name or address could not be > resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]| > |404 Not > Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]| > |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir > University|https://spark.apache.org/powered-by.html]| > |404 Not Found|[http://nsn.com/]|[Nokia Solutions and > Networks|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.nubetech.co/]|[Nube > Technologies|https://spark.apache.org/powered-by.html]| > |-1 Timeout|[http://ooyala.com/]|[Ooyala, > Inc.|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark > for Fast Queries|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sisa.samsung.com/]|[Samsung Research > America|https://spark.apache.org/powered-by.html]| > |-1 > Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP > Camp 2 [302 from > http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 > from > http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from > http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from > http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://www.packtpub.com/product/spark-cookbook/9781783987061]|[Spark > Cookbook [301 from > https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]|[Apache > Spark Graph Processing [301 from > https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark > Summit Europe|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]| > |-1 Not found: The server name or address could not be > resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing > with Spark|https://spark.apache.org/news/]| > |-1 Not found: The server name or address could not be > resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring > Spark's logs|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html]|[Shark|https://spark.apache.o
[jira] [Commented] (SPARK-40322) Fix all dead links
[ https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600384#comment-17600384 ] Yang Jie commented on SPARK-40322: -- [https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook] and [https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing] not dead links > Fix all dead links > -- > > Key: SPARK-40322 > URL: https://issues.apache.org/jira/browse/SPARK-40322 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > > https://www.deadlinkchecker.com/website-dead-link-checker.asp > > > ||Status||URL||Source link text|| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using > Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]| > |-1 Not found: The server name or address could not be > resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]| > |404 Not > Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]| > |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir > University|https://spark.apache.org/powered-by.html]| > |404 Not Found|[http://nsn.com/]|[Nokia Solutions and > Networks|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.nubetech.co/]|[Nube > Technologies|https://spark.apache.org/powered-by.html]| > |-1 Timeout|[http://ooyala.com/]|[Ooyala, > Inc.|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark > for Fast Queries|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sisa.samsung.com/]|[Samsung Research > America|https://spark.apache.org/powered-by.html]| > |-1 > Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP > Camp 2 [302 from > http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 > from > http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from > http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from > http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://www.packtpub.com/product/spark-cookbook/9781783987061]|[Spark > Cookbook [301 from > https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]|[Apache > Spark Graph Processing [301 from > https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark > Summit Europe|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]| > |-1 Not found: The server name or address could not be > resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing > with Spark|https://spark.apache.org/news/]| > |-1 Not found: The server name or address could not be > resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring > Spark's logs|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]| > |-1 > T
[jira] [Updated] (SPARK-40322) Fix all dead links
[ https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40322: - Description: [https://www.deadlinkchecker.com/website-dead-link-checker.asp] ||Status||URL||Source link text|| |-1 Not found: The server name or address could not be resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]| |-1 Not found: The server name or address could not be resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]| |404 Not Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]| |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir University|https://spark.apache.org/powered-by.html]| |404 Not Found|[http://nsn.com/]|[Nokia Solutions and Networks|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://www.nubetech.co/]|[Nube Technologies|https://spark.apache.org/powered-by.html]| |-1 Timeout|[http://ooyala.com/]|[Ooyala, Inc.|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark for Fast Queries|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://www.sisa.samsung.com/]|[Samsung Research America|https://spark.apache.org/powered-by.html]| |-1 Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]| |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP Camp 2 [302 from http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]| |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 from http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]| |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]| |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]| |-500 Internal Server Error-|-[https://www.packtpub.com/product/spark-cookbook/9781783987061]-|-[Spark Cookbook [301 from https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]-| |-500 Internal Server Error-|-[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]-|-[Apache Spark Graph Processing [301 from https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]-| |500 Internal Server Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]| |500 Internal Server Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]| |500 Internal Server Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]| |500 Internal Server Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark Summit Europe|https://spark.apache.org/news/]| |-1 Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]| |-1 Not found: The server name or address could not be resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing with Spark|https://spark.apache.org/news/]| |-1 Not found: The server name or address could not be resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring Spark's logs|https://spark.apache.org/news/]| |-1 Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]| |-1 Timeout|[http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html]|[Shark|https://spark.apache.org/news/]| |-1 Timeout|[http://strata.oreilly.com/2012/10/spark-0-6-improves-performance-and-accessibility.html]|[Spark 0.6 release|https://spark.apache.org/news/]| |404 Not Found|[http://data-informed.com/spark-an-open-source-engine-for-iterative-data-mining/]|[DataInformed|https://spark.apache.org/news/]| |-1 Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27438]|[introduction to Spark, Shark and BDAS|https://spark.apache.org/news/]| |-1 Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27440]|[hands-on exercise session|h
[jira] [Comment Edited] (SPARK-40322) Fix all dead links
[ https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600384#comment-17600384 ] Yang Jie edited comment on SPARK-40322 at 9/5/22 12:51 PM: --- [https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook] and [https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing] [https://www.packtpub.com/big-data-and-business-intelligence/big-data-analytics] not dead links was (Author: luciferyang): [https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook] and [https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing] not dead links > Fix all dead links > -- > > Key: SPARK-40322 > URL: https://issues.apache.org/jira/browse/SPARK-40322 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > > [https://www.deadlinkchecker.com/website-dead-link-checker.asp] > > > ||Status||URL||Source link text|| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using > Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]| > |-1 Not found: The server name or address could not be > resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]| > |404 Not > Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]| > |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir > University|https://spark.apache.org/powered-by.html]| > |404 Not Found|[http://nsn.com/]|[Nokia Solutions and > Networks|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.nubetech.co/]|[Nube > Technologies|https://spark.apache.org/powered-by.html]| > |-1 Timeout|[http://ooyala.com/]|[Ooyala, > Inc.|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark > for Fast Queries|https://spark.apache.org/powered-by.html]| > |-1 Not found: The server name or address could not be > resolved|[http://www.sisa.samsung.com/]|[Samsung Research > America|https://spark.apache.org/powered-by.html]| > |-1 > Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP > Camp 2 [302 from > http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 > from > http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from > http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]| > |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from > http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]| > |-500 Internal Server > Error-|-[https://www.packtpub.com/product/spark-cookbook/9781783987061]-|-[Spark > Cookbook [301 from > https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]-| > |-500 Internal Server > Error-|-[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]-|-[Apache > Spark Graph Processing [301 from > https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]-| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]| > |500 Internal Server > Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark > Summit Europe|https://spark.apache.org/news/]| > |-1 > Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]| > |-1 Not found: The server name or address could not be > resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing
[jira] [Updated] (SPARK-40322) Fix all dead links
[ https://issues.apache.org/jira/browse/SPARK-40322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40322: - Description: [https://www.deadlinkchecker.com/website-dead-link-checker.asp] ||Status||URL||Source link text|| |-1 Not found: The server name or address could not be resolved|[http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark]|[Using Parquet and Scrooge with Spark|https://spark.apache.org/documentation.html]| |-1 Not found: The server name or address could not be resolved|[http://blinkdb.org/]|[BlinkDB|https://spark.apache.org/third-party-projects.html]| |404 Not Found|[https://github.com/AyasdiOpenSource/df]|[DF|https://spark.apache.org/third-party-projects.html]| |-1 Timeout|[https://atp.io/]|[atp|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://www.sehir.edu.tr/en/]|[Istanbul Sehir University|https://spark.apache.org/powered-by.html]| |404 Not Found|[http://nsn.com/]|[Nokia Solutions and Networks|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://www.nubetech.co/]|[Nube Technologies|https://spark.apache.org/powered-by.html]| |-1 Timeout|[http://ooyala.com/]|[Ooyala, Inc.|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://engineering.ooyala.com/blog/fast-spark-queries-memory-datasets]|[Spark for Fast Queries|https://spark.apache.org/powered-by.html]| |-1 Not found: The server name or address could not be resolved|[http://www.sisa.samsung.com/]|[Samsung Research America|https://spark.apache.org/powered-by.html]| |-1 Timeout|[https://checker.apache.org/projs/spark.html]|[https://checker.apache.org/projs/spark.html|https://spark.apache.org/release-process.html]| |404 Not Found|[https://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|[AMP Camp 2 [302 from http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/]|https://spark.apache.org/documentation.html]| |404 Not Found|[https://ampcamp.berkeley.edu/agenda-2012/]|[AMP Camp 1 [302 from http://ampcamp.berkeley.edu/agenda-2012/]|https://spark.apache.org/documentation.html]| |404 Not Found|[https://ampcamp.berkeley.edu/4/]|[AMP Camp 4 [302 from http://ampcamp.berkeley.edu/4/]|https://spark.apache.org/documentation.html]| |404 Not Found|[https://ampcamp.berkeley.edu/3/]|[AMP Camp 3 [302 from http://ampcamp.berkeley.edu/3/]|https://spark.apache.org/documentation.html]| |-500 Internal Server Error-|-[https://www.packtpub.com/product/spark-cookbook/9781783987061]-|-[Spark Cookbook [301 from https://www.packtpub.com/big-data-and-business-intelligence/spark-cookbook]|https://spark.apache.org/documentation.html]-| |-500 Internal Server Error-|-[https://www.packtpub.com/product/apache-spark-graph-processing/9781784391805]-|-[Apache Spark Graph Processing [301 from https://www.packtpub.com/big-data-and-business-intelligence/apache-spark-graph-processing]|https://spark.apache.org/documentation.html]-| |500 Internal Server Error|[https://prevalentdesignevents.com/sparksummit/eu17/]|[register|https://spark.apache.org/news/]| |500 Internal Server Error|[https://prevalentdesignevents.com/sparksummit/ss17/?_ga=1.211902866.780052874.1433437196]|[register|https://spark.apache.org/news/]| |500 Internal Server Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/registration.aspx?source=header]|[register|https://spark.apache.org/news/]| |500 Internal Server Error|[https://www.prevalentdesignevents.com/sparksummit2015/europe/speaker/]|[Spark Summit Europe|https://spark.apache.org/news/]| |-1 Timeout|[http://strataconf.com/strata2013]|[Strata|https://spark.apache.org/news/]| |-1 Not found: The server name or address could not be resolved|[http://blog.quantifind.com/posts/spark-unit-test/]|[Unit testing with Spark|https://spark.apache.org/news/]| |-1 Not found: The server name or address could not be resolved|[http://blog.quantifind.com/posts/logging-post/]|[Configuring Spark's logs|https://spark.apache.org/news/]| |-1 Timeout|[http://strata.oreilly.com/2012/08/seven-reasons-why-i-like-spark.html]|[Spark|https://spark.apache.org/news/]| |-1 Timeout|[http://strata.oreilly.com/2012/11/shark-real-time-queries-and-analytics-for-big-data.html]|[Shark|https://spark.apache.org/news/]| |-1 Timeout|[http://strata.oreilly.com/2012/10/spark-0-6-improves-performance-and-accessibility.html]|[Spark 0.6 release|https://spark.apache.org/news/]| |404 Not Found|[http://data-informed.com/spark-an-open-source-engine-for-iterative-data-mining/]|[DataInformed|https://spark.apache.org/news/]| |-1 Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27438]|[introduction to Spark, Shark and BDAS|https://spark.apache.org/news/]| |-1 Timeout|[http://strataconf.com/strata2013/public/schedule/detail/27440]|[hands-on exercise session|h
[jira] [Assigned] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate
[ https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40352: Assignee: Max Gekk (was: Apache Spark) > Add function aliases: len, datepart, dateadd, date_diff and curdate > --- > > Key: SPARK-40352 > URL: https://issues.apache.org/jira/browse/SPARK-40352 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The functions len, datepart, dateadd, date_diff and curdate exist in other > systems, and Spark SQL has similar functions. So, adding such aliases will > make the migration to Spark SQL easier. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate
[ https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40352: Assignee: Apache Spark (was: Max Gekk) > Add function aliases: len, datepart, dateadd, date_diff and curdate > --- > > Key: SPARK-40352 > URL: https://issues.apache.org/jira/browse/SPARK-40352 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The functions len, datepart, dateadd, date_diff and curdate exist in other > systems, and Spark SQL has similar functions. So, adding such aliases will > make the migration to Spark SQL easier. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate
Max Gekk created SPARK-40352: Summary: Add function aliases: len, datepart, dateadd, date_diff and curdate Key: SPARK-40352 URL: https://issues.apache.org/jira/browse/SPARK-40352 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk The functions len, datepart, dateadd, date_diff and curdate exist in other systems, and Spark SQL has similar functions. So, adding such aliases will make the migration to Spark SQL easier. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40352) Add function aliases: len, datepart, dateadd, date_diff and curdate
[ https://issues.apache.org/jira/browse/SPARK-40352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600458#comment-17600458 ] Apache Spark commented on SPARK-40352: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37804 > Add function aliases: len, datepart, dateadd, date_diff and curdate > --- > > Key: SPARK-40352 > URL: https://issues.apache.org/jira/browse/SPARK-40352 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > The functions len, datepart, dateadd, date_diff and curdate exist in other > systems, and Spark SQL has similar functions. So, adding such aliases will > make the migration to Spark SQL easier. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600496#comment-17600496 ] Kyle Kent commented on SPARK-38004: --- [~itholic] I can create a PR for this. Should the change fit here in this function? https://github.com/apache/spark/blob/f9409ce7d49c25718317298031c84d1c8d6317af/python/pyspark/pandas/namespace.py#:~:text=internally.-,mangle_dupe_cols%20%3A%20bool%2C%20default%20True,are%20duplicate%20names%20in%20the%20columns.,-**kwds%20%3A%20optional I'm thinking of adding it as a note after the mangle_dup_col parameter. > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40353) Re-enable the `read_excel` tests
Haejoon Lee created SPARK-40353: --- Summary: Re-enable the `read_excel` tests Key: SPARK-40353 URL: https://issues.apache.org/jira/browse/SPARK-40353 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Haejoon Lee So far, we've been skipping the `read_excel` test in pandas API on Spark: https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251 In https://github.com/apache/spark/pull/37671, we installing `openpyxl==3.0.10` to re-enable the `read_excel` tests, but it's still failing for some reason (Please see https://github.com/apache/spark/pull/37671#issuecomment-1237515485 for more detail). We should re-enable this test for improving the pandas-on-Spark test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40353) Re-enable the `read_excel` tests
[ https://issues.apache.org/jira/browse/SPARK-40353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-40353: Description: So far, we've been skipping the `read_excel` test in pandas API on Spark: https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251 In https://github.com/apache/spark/pull/37671, we installed `openpyxl==3.0.10` to re-enable the `read_excel` tests, but it's still failing for some reason (Please see https://github.com/apache/spark/pull/37671#issuecomment-1237515485 for more detail). We should re-enable this test for improving the pandas-on-Spark test coverage. was: So far, we've been skipping the `read_excel` test in pandas API on Spark: https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251 In https://github.com/apache/spark/pull/37671, we installing `openpyxl==3.0.10` to re-enable the `read_excel` tests, but it's still failing for some reason (Please see https://github.com/apache/spark/pull/37671#issuecomment-1237515485 for more detail). We should re-enable this test for improving the pandas-on-Spark test coverage. > Re-enable the `read_excel` tests > > > Key: SPARK-40353 > URL: https://issues.apache.org/jira/browse/SPARK-40353 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > So far, we've been skipping the `read_excel` test in pandas API on Spark: > https://github.com/apache/spark/blob/6d2ce128058b439094cd1dd54253372af6977e79/python/pyspark/pandas/tests/test_dataframe_spark_io.py#L251 > In https://github.com/apache/spark/pull/37671, we installed > `openpyxl==3.0.10` to re-enable the `read_excel` tests, but it's still > failing for some reason (Please see > https://github.com/apache/spark/pull/37671#issuecomment-1237515485 for more > detail). > We should re-enable this test for improving the pandas-on-Spark test coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38004) read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns but fails if the duplicate columns are case sensitive.
[ https://issues.apache.org/jira/browse/SPARK-38004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600528#comment-17600528 ] Haejoon Lee commented on SPARK-38004: - [~kentkr] Yes, I think adding a note for the parameter looks good enough for now. Please go ahead to create a PR and ping me. I'm willing to review for this :) > read_excel's parameter - mangle_dupe_cols is used to handle duplicate columns > but fails if the duplicate columns are case sensitive. > > > Key: SPARK-38004 > URL: https://issues.apache.org/jira/browse/SPARK-38004 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Saikrishna Pujari >Priority: Minor > > mangle_dupe_cols - default is True > So ideally it should have handled duplicate columns, but in case the columns > are case sensitive it fails as below. > AnalysisException: Reference '{{{}Sheet.col{}}}' is ambiguous, could be > Sheet.col, Sheet.col. > Where two columns are Col and cOL > In the best practices, there is a mention of not to use case sensitive > columns - > [https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#do-not-use-duplicated-column-names] > Either the docs for read_excel/mangle_dupe_cols have to be updated about this > or it has to be handled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36999) Document the command ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/SPARK-36999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600530#comment-17600530 ] Rajanikant Vellaturi commented on SPARK-36999: -- Hi [~maxgekk] , Can I work on this? Please let me know. Thanks > Document the command ALTER TABLE RECOVER PARTITIONS > --- > > Key: SPARK-36999 > URL: https://issues.apache.org/jira/browse/SPARK-36999 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > Labels: starter > > Update the page > [https://spark.apache.org/docs/3.1.2/sql-ref-syntax-ddl-alter-table.html,] > and document the command ALTER TABLE RECOVER PARTITIONS -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10
[ https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600535#comment-17600535 ] Yuming Wang commented on SPARK-40351: - https://github.com/apache/spark/blob/v3.3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Sum.scala#L52-L53 Why do you want to override this value? > Spark Sum increases the precision of DecimalType arguments by 10 > > > Key: SPARK-40351 > URL: https://issues.apache.org/jira/browse/SPARK-40351 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 3.2.0 >Reporter: Tymofii >Priority: Minor > > Currently in Spark automatically increases Decimal field by 10 (hard coded > value) after SUM aggregate operation - > [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.] > There are a couple of questions: > # Why was 10 chosen as default one? > # Is it make sense to allow the user to override this value via > configuration? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-40326: Assignee: Bjørn Jørgensen > upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 > to 2.13.4 > -- > > Key: SPARK-40326 > URL: https://issues.apache.org/jira/browse/SPARK-40326 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > > [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857] > [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40326: - Priority: Minor (was: Major) > upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 > to 2.13.4 > -- > > Key: SPARK-40326 > URL: https://issues.apache.org/jira/browse/SPARK-40326 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > > [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857] > [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40326) upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 to 2.13.4
[ https://issues.apache.org/jira/browse/SPARK-40326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-40326. -- Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37796 [https://github.com/apache/spark/pull/37796] > upgrade com.fasterxml.jackson.dataformat:jackson-dataformat-yaml from 2.13.3 > to 2.13.4 > -- > > Key: SPARK-40326 > URL: https://issues.apache.org/jira/browse/SPARK-40326 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.3.1, 3.4.0 > > > [CVE-2022-25857|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-25857] > [SNYK-JAVA-ORGYAML|https://security.snyk.io/vuln/SNYK-JAVA-ORGYAML-2806360] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39671) insert overwrite table java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadPartition .This problem occurred when we installed Apache Spark3.0.
[ https://issues.apache.org/jira/browse/SPARK-39671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600541#comment-17600541 ] Iqbal Singh commented on SPARK-39671: - Is there a way to reproduce it, or is this something specific to Cloudera distribution only. > insert overwrite table java.lang.NoSuchMethodException: > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition .This problem occurred > when we installed Apache Spark3.0.1-hadoop3.0 in CDH6.1.1 > > > Key: SPARK-39671 > URL: https://issues.apache.org/jira/browse/SPARK-39671 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: xin >Priority: Major > > use spark-thrifter run this sql insert overwrite table xx.xx > partition(dt=2022-06-30) select * from xxx.xxx; The SQL execution > environment is cdh 6.1.1 hive version 2.1.1 > > > raise OperationalError(response) pyhive.exc.OperationalError: > TExecuteStatementResp(status=TStatus(statusCode=3, > infoMessages=['*org.apache.hive.service.cli.HiveSQLException:Error running > query: java.lang.NoSuchMethodException: > org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path, > java.lang.String, java.util.Map, boolean, boolean, boolean, boolean, > boolean, boolean):25:24', > 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:321', > > 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:202', > 'org.apache.hive.service.cli.operation.Operation:run:Operation.java:278', > 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkOperation$$super$run:SparkExecuteStatementOperation.scala:46', > > 'org.apache.spark.sql.hive.thriftserver.SparkOperation:$anonfun$run$1:SparkOperation.scala:44', > 'scala.runtime.java8.JFunction0$mcV$sp:apply:JFunction0$mcV$sp.java:23', > 'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties:SparkOperation.scala:78', > > 'org.apache.spark.sql.hive.thriftserver.SparkOperation:withLocalProperties$:SparkOperation.scala:62', > > 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:withLocalProperties:SparkExecuteStatementOperation.scala:46', > > 'org.apache.spark.sql.hive.thriftserver.SparkOperation:run:SparkOperation.scala:44', > > 'org.apache.spark.sql.hive.thriftserver.SparkOperation:run$:SparkOperation.scala:42', > > 'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:run:SparkExecuteStatementOperation.scala:46', > > 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:484', > > 'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:460', > > 'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:280', > > 'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:439', > > 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', > > 'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', > 'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:38', > 'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', > 'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:53', > > 'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:310', > > 'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', > > 'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', > 'java.lang.Thread:run:Thread.java:748', > '*java.lang.NoSuchMethodException:org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path, > java.lang.String, java.util.Map, boolean, boolean, boolean, boolean, > boolean, boolean):63:38', 'java.lang.Class:getMethod:Class.java:1786', > 'org.apache.spark.sql.hive.client.Shim:findMethod:HiveShim.scala:177', > 'org.apache.spark.sql.hive.client.Shim_v2_1:loadPartitionMethod$lzycompute:HiveShim.scala:1151', > > 'org.apache.spark.sql.hive.client.Shim_v2_1:loadPartitionMethod:HiveShim.scala:1139', > > 'org.apache.spark.sql.hive.client.Shim_v2_1:loadPartition:HiveShim.scala:1201', > > 'org.apache.spark.sql.hive.client.HiveClientImpl:$anonfun$loadPartition$1:HiveClientImpl.scala:872', > 'scala.runtime.ja
[jira] [Resolved] (SPARK-40313) ps.DataFrame(data, index) should support the same anchor
[ https://issues.apache.org/jira/browse/SPARK-40313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40313. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37768 [https://github.com/apache/spark/pull/37768] > ps.DataFrame(data, index) should support the same anchor > > > Key: SPARK-40313 > URL: https://issues.apache.org/jira/browse/SPARK-40313 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40313) ps.DataFrame(data, index) should support the same anchor
[ https://issues.apache.org/jira/browse/SPARK-40313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40313: Assignee: Ruifeng Zheng > ps.DataFrame(data, index) should support the same anchor > > > Key: SPARK-40313 > URL: https://issues.apache.org/jira/browse/SPARK-40313 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39752) Spark job failed with 10M rows data with Broken pipe error
[ https://issues.apache.org/jira/browse/SPARK-39752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600543#comment-17600543 ] Iqbal Singh commented on SPARK-39752: - [~sshukla05] , Could you please provide the stack trace for the issue or a way to reproduce the error. > Spark job failed with 10M rows data with Broken pipe error > -- > > Key: SPARK-39752 > URL: https://issues.apache.org/jira/browse/SPARK-39752 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 3.0.3, 3.2.1 >Reporter: SHOBHIT SHUKLA >Priority: Major > Fix For: 3.0.2 > > > Spark job failed with 10M rows data with Broken pipe error. Same spark job > was working previously with the settings "executor_cores": 1, > "executor_memory": 1, "driver_cores": 1, "driver_memory": 1. where as the > same job is failing with spark settings in 3.0.3 and 3.2.1. > Major symptoms (slowness, timeout, out of memory as examples): Spark job is > failing with the error java.net.SocketException: Broken pipe (Write failed) > Here are the spark settings information which is working on Spark 3.0.3 and > 3.2.1 : "executor_cores": 4, "executor_memory": 4, "driver_cores": 4, > "driver_memory": 4.. The spark job doesn't consistently works with the above > settings. Some times, need to increase the cores and memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10
[ https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tymofii updated SPARK-40351: Description: Currently in Spark automatically increases Decimal field by 10 (hard coded value) after SUM aggregate operation - [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.] There are a couple of questions: # Why was 10 chosen as default one? # Does it make sense to allow the user to override this value via configuration? was: Currently in Spark automatically increases Decimal field by 10 (hard coded value) after SUM aggregate operation - [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.] There are a couple of questions: # Why was 10 chosen as default one? # Is it make sense to allow the user to override this value via configuration? > Spark Sum increases the precision of DecimalType arguments by 10 > > > Key: SPARK-40351 > URL: https://issues.apache.org/jira/browse/SPARK-40351 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 3.2.0 >Reporter: Tymofii >Priority: Minor > > Currently in Spark automatically increases Decimal field by 10 (hard coded > value) after SUM aggregate operation - > [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.] > There are a couple of questions: > # Why was 10 chosen as default one? > # Does it make sense to allow the user to override this value via > configuration? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38404) Spark does not find CTE inside nested CTE
[ https://issues.apache.org/jira/browse/SPARK-38404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-38404: Fix Version/s: 3.3.1 > Spark does not find CTE inside nested CTE > - > > Key: SPARK-38404 > URL: https://issues.apache.org/jira/browse/SPARK-38404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > * MacOS Monterrey 12.2.1 (21D62) > * python 3.9.10 > * pip 22.0.3 > * pyspark 3.2.0 & 3.2.1 (SQL query does not work) and pyspark 3.0.1 and > 3.1.3 (SQL query works) >Reporter: Joan Heredia Rius >Assignee: Peter Toth >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > > Hello! > Seems that when defining CTEs and using them inside another CTE in Spark SQL, > Spark thinks the inner call for the CTE is a table or view, which is not > found and then it errors with `Table or view not found: ` > h3. Steps to reproduce > # `pip install pyspark==3.2.0` (also happens with 3.2.1) > # start pyspark console by typing `pyspark` in the terminal > # Try to run the following SQL with `spark.sql(sql)` > > {code:java} > WITH mock_cte__usersAS ( >SELECT 1 AS id >), >model_under_test AS ( > WITH usersAS ( > SELECT * > FROM mock_cte__users > ) >SELECT * > FROM users >) > SELECT * > FROM model_under_test;{code} > Spark will fail with > > {code:java} > pyspark.sql.utils.AnalysisException: Table or view not found: > mock_cte__users; line 8 pos 29; {code} > I don't know if this is a regression or an expected behavior of the new 3.2.* > versions. This fix introduced in 3.2.0 might be related: > https://issues.apache.org/jira/browse/SPARK-36447 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40297) CTE outer reference nested in CTE main body cannot be resolved
[ https://issues.apache.org/jira/browse/SPARK-40297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40297. - Fix Version/s: 3.4.0 3.3.1 Assignee: Wei Xue Resolution: Fixed > CTE outer reference nested in CTE main body cannot be resolved > -- > > Key: SPARK-40297 > URL: https://issues.apache.org/jira/browse/SPARK-40297 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Minor > Fix For: 3.4.0, 3.3.1 > > > AnalysisException "Table or view not found" is thrown when a CTE reference > occurs in an inner CTE definition nested in the outer CTE's main body FROM > clause. E.g., > {code} > WITH cte_outer AS ( > SELECT 1 > ) > SELECT * FROM ( > WITH cte_inner AS ( > SELECT * FROM cte_outer > ) > SELECT * FROM cte_inner > ) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40351) Spark Sum increases the precision of DecimalType arguments by 10
[ https://issues.apache.org/jira/browse/SPARK-40351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600572#comment-17600572 ] Tymofii commented on SPARK-40351: - # Not sure I understood why you showed those lines of code. # For example, the issue we faced is as follows. Source tables have decimal fields with the certain precision defined during the table creation. There are number of queries, which are used to extract and transform the data from those source tables and load it to the target one, which also has a decimal field with the same precision as in the sources tables. So the users knows for sure, that summing values in the source decimal fields may not result in exceeding the target table field precision. Currently they have to add explicit casting after SUM function to comply with the target table definition since our ETL flow would fail. It may be not very convenient if there are multiple queries. So they could disable automatic increase of the precision in this case for example. # Another question - what is the rationale behind the number 10? > Spark Sum increases the precision of DecimalType arguments by 10 > > > Key: SPARK-40351 > URL: https://issues.apache.org/jira/browse/SPARK-40351 > Project: Spark > Issue Type: Question > Components: Optimizer >Affects Versions: 3.2.0 >Reporter: Tymofii >Priority: Minor > > Currently in Spark automatically increases Decimal field by 10 (hard coded > value) after SUM aggregate operation - > [https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1877.] > There are a couple of questions: > # Why was 10 chosen as default one? > # Does it make sense to allow the user to override this value via > configuration? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39830. --- Fix Version/s: 3.4.0 Assignee: dzcxzl Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/37800 > Reading ORC table that requires type promotion may throw AIOOBE > --- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.4.0 > > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) > location 'file:///tmp/test_table_orc_t1' stored as orc ") > spark.sql("select * from test_table_orc_t1").show() {code} > Querying this table will get the following exception > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) > at > org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39830) Add a test case to read ORC table that requires type promotion
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39830: -- Summary: Add a test case to read ORC table that requires type promotion (was: Reading ORC table that requires type promotion may throw AIOOBE) > Add a test case to read ORC table that requires type promotion > -- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.4.0 > > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) > location 'file:///tmp/test_table_orc_t1' stored as orc ") > spark.sql("select * from test_table_orc_t1").show() {code} > Querying this table will get the following exception > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) > at > org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39830) Reading ORC table that requires type promotion may throw AIOOBE
[ https://issues.apache.org/jira/browse/SPARK-39830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39830: -- Component/s: Tests > Reading ORC table that requires type promotion may throw AIOOBE > --- > > Key: SPARK-39830 > URL: https://issues.apache.org/jira/browse/SPARK-39830 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.4.0 > > > We can add a UT to test the scenario after the ORC-1205 release. > > bin/spark-shell > {code:java} > spark.sql("set orc.stripe.size=10240") > spark.sql("set orc.rows.between.memory.checks=1") > spark.sql("set spark.sql.orc.columnarWriterBatchSize=1") > val df = spark.range(1, 1+512, 1, 1).map { i => > if( i == 1 ){ > (i, Array.fill[Byte](5 * 1024 * 1024)('X')) > } else { > (i,Array.fill[Byte](1)('X')) > } > }.toDF("c1","c2") > df.write.format("orc").save("file:///tmp/test_table_orc_t1") > spark.sql("create external table test_table_orc_t1 (c1 string ,c2 binary) > location 'file:///tmp/test_table_orc_t1' stored as orc ") > spark.sql("select * from test_table_orc_t1").show() {code} > Querying this table will get the following exception > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.orc.impl.TreeReaderFactory$TreeReader.nextVector(TreeReaderFactory.java:387) > at > org.apache.orc.impl.TreeReaderFactory$LongTreeReader.nextVector(TreeReaderFactory.java:740) > at > org.apache.orc.impl.ConvertTreeReaderFactory$StringGroupFromAnyIntegerTreeReader.nextVector(ConvertTreeReaderFactory.java:1069) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:84) > at > org.apache.orc.mapreduce.OrcMapreduceRecordReader.nextKeyValue(OrcMapreduceRecordReader.java:102) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org