[jira] [Commented] (SPARK-40081) Add Document Parameters for pyspark.sql.streaming.query
[ https://issues.apache.org/jira/browse/SPARK-40081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582149#comment-17582149 ] Qian Sun commented on SPARK-40081: -- [~hyukjin.kwon] Yes, I'm working on it > Add Document Parameters for pyspark.sql.streaming.query > --- > > Key: SPARK-40081 > URL: https://issues.apache.org/jira/browse/SPARK-40081 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Qian Sun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39310) rename `required_same_anchor`
[ https://issues.apache.org/jira/browse/SPARK-39310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39310. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37585 [https://github.com/apache/spark/pull/37585] > rename `required_same_anchor` > - > > Key: SPARK-39310 > URL: https://issues.apache.org/jira/browse/SPARK-39310 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > https://github.com/apache/spark/pull/36353#discussion_r882216133 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39310) rename `required_same_anchor`
[ https://issues.apache.org/jira/browse/SPARK-39310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39310: Assignee: Apache Spark > rename `required_same_anchor` > - > > Key: SPARK-39310 > URL: https://issues.apache.org/jira/browse/SPARK-39310 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/pull/36353#discussion_r882216133 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39310) rename `required_same_anchor`
[ https://issues.apache.org/jira/browse/SPARK-39310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39310: Assignee: Yikun Jiang (was: Apache Spark) > rename `required_same_anchor` > - > > Key: SPARK-39310 > URL: https://issues.apache.org/jira/browse/SPARK-39310 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > https://github.com/apache/spark/pull/36353#discussion_r882216133 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+
[ https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39150. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37584 [https://github.com/apache/spark/pull/37584] > Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas > to 1.4+ > --- > > Key: SPARK-39150 > URL: https://issues.apache.org/jira/browse/SPARK-39150 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333] > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265] > all doctest in https://github.com/apache/spark/pull/36712 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39170) ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
[ https://issues.apache.org/jira/browse/SPARK-39170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39170: Assignee: Yikun Jiang > ImportError when creating pyspark.pandas document "Supported APIs" if pandas > version is low. > > > Key: SPARK-39170 > URL: https://issues.apache.org/jira/browse/SPARK-39170 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > The pyspark.pandas documentation "Supported APIs" will be auto-generated. > ([SPARK-38961|https://issues.apache.org/jira/browse/SPARK-38961]) > At this point, we need to verify the version of pandas. It can be applied > after the docker image used in github action is upgraded and republished at > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage. > Related: https://github.com/apache/spark/pull/36509 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+
[ https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39150: Assignee: Yikun Jiang > Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas > to 1.4+ > --- > > Key: SPARK-39150 > URL: https://issues.apache.org/jira/browse/SPARK-39150 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333] > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265] > all doctest in https://github.com/apache/spark/pull/36712 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-40142: -- > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40142: - Fix Version/s: (was: 3.4.0) > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40142. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37581 [https://github.com/apache/spark/pull/37581] > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40145) Create infra image when cut down branches
[ https://issues.apache.org/jira/browse/SPARK-40145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40145: Assignee: Yikun Jiang > Create infra image when cut down branches > - > > Key: SPARK-40145 > URL: https://issues.apache.org/jira/browse/SPARK-40145 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40145) Create infra image when cut down branches
[ https://issues.apache.org/jira/browse/SPARK-40145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40145. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37579 [https://github.com/apache/spark/pull/37579] > Create infra image when cut down branches > - > > Key: SPARK-40145 > URL: https://issues.apache.org/jira/browse/SPARK-40145 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39170) ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
[ https://issues.apache.org/jira/browse/SPARK-39170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39170. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37583 [https://github.com/apache/spark/pull/37583] > ImportError when creating pyspark.pandas document "Supported APIs" if pandas > version is low. > > > Key: SPARK-39170 > URL: https://issues.apache.org/jira/browse/SPARK-39170 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > The pyspark.pandas documentation "Supported APIs" will be auto-generated. > ([SPARK-38961|https://issues.apache.org/jira/browse/SPARK-38961]) > At this point, we need to verify the version of pandas. It can be applied > after the docker image used in github action is upgraded and republished at > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage. > Related: https://github.com/apache/spark/pull/36509 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-38648) SPIP: Simplified API for DL Inferencing
[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582071#comment-17582071 ] Xiangrui Meng edited comment on SPARK-38648 at 8/19/22 10:55 PM: - I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't feel the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. It is a just a wrapper over pandas_udf for DL inference. was (Author: mengxr): I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't free the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. > SPIP: Simplified API for DL Inferencing > --- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Lee Yang >Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubsc
[jira] [Commented] (SPARK-38648) SPIP: Simplified API for DL Inferencing
[ https://issues.apache.org/jira/browse/SPARK-38648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582071#comment-17582071 ] Xiangrui Meng commented on SPARK-38648: --- I had an offline discussion with [~leewyang]. Summary: We might not need to introduce a new package in Spark with dependencies on DL frameworks. Instead, we can provide abstractions in pyspark.ml to implement the common data operations needed by DL inference, e.g., batching, tensor conversion, pipelining, etc. For example, we can define the following API (just to illustrate the idea, not proposing the final API): {code:scala} def dl_model_udf( predict_fn: Callable[pd.DataFrame, pd.DataFrame], # need to discuss the data format batch_size: int, input_tensor_shapes: Map[str, List[int]], output_data_type, preprocess_fn, ... ) -> PandasUDF {code} Users only need to supply predict_fn, which could return a (wrapped) TensorFlow model, a PyTorch model, or an MLflow model. Users are responsible for package dependency management and model loading logics. We doesn't cover everything proposed in the original SPIP but we do save the boilerplate code for users on creating batches over Iterator[DataFrame], converting 1d arrays to tensors, and async preprocessing (CPU) and prediction (GPU). If we go with this direction, I don't free the change needs an SPIP because it doesn't introduce a new Spark package nor new dependencies. > SPIP: Simplified API for DL Inferencing > --- > > Key: SPARK-38648 > URL: https://issues.apache.org/jira/browse/SPARK-38648 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: Lee Yang >Priority: Minor > > h1. Background and Motivation > The deployment of deep learning (DL) models to Spark clusters can be a point > of friction today. DL practitioners often aren't well-versed with Spark, and > Spark experts often aren't well-versed with the fast-changing DL frameworks. > Currently, the deployment of trained DL models is done in a fairly ad-hoc > manner, with each model integration usually requiring significant effort. > To simplify this process, we propose adding an integration layer for each > major DL framework that can introspect their respective saved models to > more-easily integrate these models into Spark applications. You can find a > detailed proposal here: > [https://docs.google.com/document/d/1n7QPHVZfmQknvebZEXxzndHPV2T71aBsDnP4COQa_v0] > h1. Goals > - Simplify the deployment of pre-trained single-node DL models to Spark > inference applications. > - Follow pandas_udf for simple inference use-cases. > - Follow Spark ML Pipelines APIs for transfer-learning use-cases. > - Enable integrations with popular third-party DL frameworks like > TensorFlow, PyTorch, and Huggingface. > - Focus on PySpark, since most of the DL frameworks use Python. > - Take advantage of built-in Spark features like GPU scheduling and Arrow > integration. > - Enable inference on both CPU and GPU. > h1. Non-goals > - DL model training. > - Inference w/ distributed models, i.e. "model parallel" inference. > h1. Target Personas > - Data scientists who need to deploy DL models on Spark. > - Developers who need to deploy DL models on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40153) Unify the logic of resolve functions and table-valued functions
[ https://issues.apache.org/jira/browse/SPARK-40153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40153: Assignee: (was: Apache Spark) > Unify the logic of resolve functions and table-valued functions > --- > > Key: SPARK-40153 > URL: https://issues.apache.org/jira/browse/SPARK-40153 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Make ResolveTableValuedFunctions similar to ResolveFunctions: first try > resolving the function as a built-in or temp function, then expand the > identifier and resolve it as a persistent function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40153) Unify the logic of resolve functions and table-valued functions
[ https://issues.apache.org/jira/browse/SPARK-40153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40153: Assignee: Apache Spark > Unify the logic of resolve functions and table-valued functions > --- > > Key: SPARK-40153 > URL: https://issues.apache.org/jira/browse/SPARK-40153 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > > Make ResolveTableValuedFunctions similar to ResolveFunctions: first try > resolving the function as a built-in or temp function, then expand the > identifier and resolve it as a persistent function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40153) Unify the logic of resolve functions and table-valued functions
[ https://issues.apache.org/jira/browse/SPARK-40153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582053#comment-17582053 ] Apache Spark commented on SPARK-40153: -- User 'allisonwang-db' has created a pull request for this issue: https://github.com/apache/spark/pull/37586 > Unify the logic of resolve functions and table-valued functions > --- > > Key: SPARK-40153 > URL: https://issues.apache.org/jira/browse/SPARK-40153 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Allison Wang >Priority: Major > > Make ResolveTableValuedFunctions similar to ResolveFunctions: first try > resolving the function as a built-in or temp function, then expand the > identifier and resolve it as a persistent function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40153) Unify the logic of resolve functions and table-valued functions
Allison Wang created SPARK-40153: Summary: Unify the logic of resolve functions and table-valued functions Key: SPARK-40153 URL: https://issues.apache.org/jira/browse/SPARK-40153 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Allison Wang Make ResolveTableValuedFunctions similar to ResolveFunctions: first try resolving the function as a built-in or temp function, then expand the identifier and resolve it as a persistent function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40152) Codegen compilation error when using split_part
[ https://issues.apache.org/jira/browse/SPARK-40152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582045#comment-17582045 ] Bruce Robbins commented on SPARK-40152: --- Seems to be a simple case of missing semicolons. I think it's a very simple fix. > Codegen compilation error when using split_part > --- > > Key: SPARK-40152 > URL: https://issues.apache.org/jira/browse/SPARK-40152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Bruce Robbins >Priority: Major > > The following query throws an error: > {noformat} > create or replace temp view v1 as > select * from values > ('11.12.13', '.', 3) > as v1(col1, col2, col3); > cache table v1; > SELECT split_part(col1, col2, col3) > from v1; > {noformat} > The error is: > {noformat} > 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 42, Column 1: Expression "project_isNull_0 = false" is not a type > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 42, Column 1: Expression "project_isNull_0 = false" is not a type > at > org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934) > at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887) > at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811) > at org.codehaus.janino.Parser.parseBlock(Parser.java:1792) > at > {noformat} > In the end, {{split_part}} does successfully execute, although in interpreted > mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40152) Codegen compilation error when using split_part
Bruce Robbins created SPARK-40152: - Summary: Codegen compilation error when using split_part Key: SPARK-40152 URL: https://issues.apache.org/jira/browse/SPARK-40152 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Bruce Robbins The following query throws an error: {noformat} create or replace temp view v1 as select * from values ('11.12.13', '.', 3) as v1(col1, col2, col3); cache table v1; SELECT split_part(col1, col2, col3) from v1; {noformat} The error is: {noformat} 22/08/19 14:25:14 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: Expression "project_isNull_0 = false" is not a type org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 42, Column 1: Expression "project_isNull_0 = false" is not a type at org.codehaus.janino.Java$Atom.toTypeOrCompileException(Java.java:3934) at org.codehaus.janino.Parser.parseBlockStatement(Parser.java:1887) at org.codehaus.janino.Parser.parseBlockStatements(Parser.java:1811) at org.codehaus.janino.Parser.parseBlock(Parser.java:1792) at {noformat} In the end, {{split_part}} does successfully execute, although in interpreted mode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40065: - Assignee: Nobuaki Sukegawa > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Assignee: Nobuaki Sukegawa >Priority: Minor > Fix For: 3.3.1, 3.2.3 > > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it is not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40065) Executor ConfigMap is not mounted if profile is not default
[ https://issues.apache.org/jira/browse/SPARK-40065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40065. --- Fix Version/s: 3.3.1 3.2.3 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/37504 > Executor ConfigMap is not mounted if profile is not default > --- > > Key: SPARK-40065 > URL: https://issues.apache.org/jira/browse/SPARK-40065 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Nobuaki Sukegawa >Priority: Minor > Fix For: 3.3.1, 3.2.3 > > > When executor config map is made optional in SPARK-34316, mount volume is > unconditionally disabled erroneously when non-default profile is used. > When spark.kubernetes.executor.disableConfigMap is false, expected behavior > is that the ConfigMap is mounted regardless of executor's resource profile. > However, it is not mounted if the resource profile is non-default. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40060) Add numberDecommissioningExecutors metric
[ https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40060: - Assignee: Zhongwei Zhu > Add numberDecommissioningExecutors metric > - > > Key: SPARK-40060 > URL: https://issues.apache.org/jira/browse/SPARK-40060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > > The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40060) Add numberDecommissioningExecutors metric
[ https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40060. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37499 [https://github.com/apache/spark/pull/37499] > Add numberDecommissioningExecutors metric > - > > Key: SPARK-40060 > URL: https://issues.apache.org/jira/browse/SPARK-40060 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.4.0 > > > The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40151) Fix return type for new median(interval) function
Serge Rielau created SPARK-40151: Summary: Fix return type for new median(interval) function Key: SPARK-40151 URL: https://issues.apache.org/jira/browse/SPARK-40151 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Serge Rielau median() right now returns an interval of the same type as the input. We should instead match mean and avg(): The result type is computed as for the arguments: - year-month interval: The result is an `INTERVAL YEAR TO MONTH`. - day-time interval: The result is an `INTERVAL DAY TO SECOND`. - In all other cases the result is a DOUBLE. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-38582: - Assignee: Qian Sun > Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for > `KubernetesUtils` to eliminate duplicate code pattern > --- > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Qian Sun >Assignee: Qian Sun >Priority: Minor > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38582) Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-38582. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 35886 [https://github.com/apache/spark/pull/35886] > Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for > `KubernetesUtils` to eliminate duplicate code pattern > --- > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Qian Sun >Assignee: Qian Sun >Priority: Minor > Fix For: 3.4.0 > > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38582) Add KubernetesUtils.buildEnvVars(WithFieldRef)? utility functions
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38582: -- Affects Version/s: 3.4.0 (was: 3.2.1) > Add KubernetesUtils.buildEnvVars(WithFieldRef)? utility functions > - > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Qian Sun >Assignee: Qian Sun >Priority: Minor > Fix For: 3.4.0 > > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38582) Add KubernetesUtils.buildEnvVars(WithFieldRef)? utility functions
[ https://issues.apache.org/jira/browse/SPARK-38582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38582: -- Summary: Add KubernetesUtils.buildEnvVars(WithFieldRef)? utility functions (was: Introduce `buildEnvVarsWithKV` and `buildEnvVarsWithFieldRef` for `KubernetesUtils` to eliminate duplicate code pattern) > Add KubernetesUtils.buildEnvVars(WithFieldRef)? utility functions > - > > Key: SPARK-38582 > URL: https://issues.apache.org/jira/browse/SPARK-38582 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: Qian Sun >Assignee: Qian Sun >Priority: Minor > Fix For: 3.4.0 > > > There are many duplicate code patterns in Spark Code: > {code:java} > new EnvVarBuilder() > .withName(key) > .withValue(value) > .build() {code} > {code:java} > new EnvVarBuilder() >.withName(name) > .withValueFrom(new EnvVarSourceBuilder() >.withNewFieldRef(version, field) >.build()) >.build() > {code} > > [The assignment statement for executor envVar | > https://github.com/apache/spark/blob/branch-3.3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L123-L185] > has 63 lines. We could introduce _buildEnvVarsWithKV_ and > _buildEnvVarsWithFieldRef_ function to simplify the above code patterns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40000) Add config to toggle whether to automatically add default values for INSERTs without user-specified fields
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel resolved SPARK-4. Fix Version/s: 3.4.0 Resolution: Won't Fix Upon further analysis, we decided not to move forward with this change as it added too much complexity to downstream data sources. > Add config to toggle whether to automatically add default values for INSERTs > without user-specified fields > -- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39310) rename `required_same_anchor`
[ https://issues.apache.org/jira/browse/SPARK-39310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581926#comment-17581926 ] Apache Spark commented on SPARK-39310: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37585 > rename `required_same_anchor` > - > > Key: SPARK-39310 > URL: https://issues.apache.org/jira/browse/SPARK-39310 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > https://github.com/apache/spark/pull/36353#discussion_r882216133 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39310) rename `required_same_anchor`
[ https://issues.apache.org/jira/browse/SPARK-39310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39310: Assignee: (was: Apache Spark) > rename `required_same_anchor` > - > > Key: SPARK-39310 > URL: https://issues.apache.org/jira/browse/SPARK-39310 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > https://github.com/apache/spark/pull/36353#discussion_r882216133 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39310) rename `required_same_anchor`
[ https://issues.apache.org/jira/browse/SPARK-39310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39310: Assignee: Apache Spark > rename `required_same_anchor` > - > > Key: SPARK-39310 > URL: https://issues.apache.org/jira/browse/SPARK-39310 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/pull/36353#discussion_r882216133 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39310) rename `required_same_anchor`
[ https://issues.apache.org/jira/browse/SPARK-39310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581924#comment-17581924 ] Apache Spark commented on SPARK-39310: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37585 > rename `required_same_anchor` > - > > Key: SPARK-39310 > URL: https://issues.apache.org/jira/browse/SPARK-39310 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > https://github.com/apache/spark/pull/36353#discussion_r882216133 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+
[ https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581919#comment-17581919 ] Apache Spark commented on SPARK-39150: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37584 > Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas > to 1.4+ > --- > > Key: SPARK-39150 > URL: https://issues.apache.org/jira/browse/SPARK-39150 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333] > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265] > all doctest in https://github.com/apache/spark/pull/36712 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+
[ https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581918#comment-17581918 ] Apache Spark commented on SPARK-39150: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37584 > Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas > to 1.4+ > --- > > Key: SPARK-39150 > URL: https://issues.apache.org/jira/browse/SPARK-39150 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333] > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265] > all doctest in https://github.com/apache/spark/pull/36712 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+
[ https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39150: Assignee: Apache Spark > Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas > to 1.4+ > --- > > Key: SPARK-39150 > URL: https://issues.apache.org/jira/browse/SPARK-39150 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333] > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265] > all doctest in https://github.com/apache/spark/pull/36712 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39150) Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas to 1.4+
[ https://issues.apache.org/jira/browse/SPARK-39150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39150: Assignee: (was: Apache Spark) > Remove `# doctest: +SKIP` of SPARK-38947/SPARK-39326 when infra dump pandas > to 1.4+ > --- > > Key: SPARK-39150 > URL: https://issues.apache.org/jira/browse/SPARK-39150 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2333] > [https://github.com/apache/spark/blob/fe85d7912f86c3e337aa93b23bfa7e7e01c0a32e/python/pyspark/pandas/groupby.py#L2265] > all doctest in https://github.com/apache/spark/pull/36712 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40018) Output SparkThrowable to SQL golden files in JSON format
[ https://issues.apache.org/jira/browse/SPARK-40018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40018. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37452 [https://github.com/apache/spark/pull/37452] > Output SparkThrowable to SQL golden files in JSON format > > > Key: SPARK-40018 > URL: https://issues.apache.org/jira/browse/SPARK-40018 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Output content of SparkThrowable in the JSON format instead of plain text. > For instance, replace: > {code} > [INVALID_ARRAY_INDEX_IN_ELEMENT_AT] The index 5 is out of bounds. The array > has 3 elements. Use `try_element_at` to tolerate accessing element at invalid > index and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. > == SQL(line 1, position 8) == > select element_at(array(1, 2, 3), 5) >^ > {code} > by > {code} > {"errorClass":"INVALID_ARRAY_INDEX_IN_ELEMENT_AT","messageParameters":["5","3","\"spark.sql.ansi.enabled\""],"queryContext":[{"objectType":"","objectName":"","startIndex":7,"stopIndex":35,"fragment":"element_at(array(1, > 2, 3), 5"}]} > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39170) ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
[ https://issues.apache.org/jira/browse/SPARK-39170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581905#comment-17581905 ] Apache Spark commented on SPARK-39170: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37583 > ImportError when creating pyspark.pandas document "Supported APIs" if pandas > version is low. > > > Key: SPARK-39170 > URL: https://issues.apache.org/jira/browse/SPARK-39170 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Priority: Major > > The pyspark.pandas documentation "Supported APIs" will be auto-generated. > ([SPARK-38961|https://issues.apache.org/jira/browse/SPARK-38961]) > At this point, we need to verify the version of pandas. It can be applied > after the docker image used in github action is upgraded and republished at > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage. > Related: https://github.com/apache/spark/pull/36509 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39170) ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
[ https://issues.apache.org/jira/browse/SPARK-39170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39170: Assignee: Apache Spark > ImportError when creating pyspark.pandas document "Supported APIs" if pandas > version is low. > > > Key: SPARK-39170 > URL: https://issues.apache.org/jira/browse/SPARK-39170 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Assignee: Apache Spark >Priority: Major > > The pyspark.pandas documentation "Supported APIs" will be auto-generated. > ([SPARK-38961|https://issues.apache.org/jira/browse/SPARK-38961]) > At this point, we need to verify the version of pandas. It can be applied > after the docker image used in github action is upgraded and republished at > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage. > Related: https://github.com/apache/spark/pull/36509 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39170) ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
[ https://issues.apache.org/jira/browse/SPARK-39170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581903#comment-17581903 ] Apache Spark commented on SPARK-39170: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37583 > ImportError when creating pyspark.pandas document "Supported APIs" if pandas > version is low. > > > Key: SPARK-39170 > URL: https://issues.apache.org/jira/browse/SPARK-39170 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Priority: Major > > The pyspark.pandas documentation "Supported APIs" will be auto-generated. > ([SPARK-38961|https://issues.apache.org/jira/browse/SPARK-38961]) > At this point, we need to verify the version of pandas. It can be applied > after the docker image used in github action is upgraded and republished at > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage. > Related: https://github.com/apache/spark/pull/36509 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39170) ImportError when creating pyspark.pandas document "Supported APIs" if pandas version is low.
[ https://issues.apache.org/jira/browse/SPARK-39170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39170: Assignee: (was: Apache Spark) > ImportError when creating pyspark.pandas document "Supported APIs" if pandas > version is low. > > > Key: SPARK-39170 > URL: https://issues.apache.org/jira/browse/SPARK-39170 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Priority: Major > > The pyspark.pandas documentation "Supported APIs" will be auto-generated. > ([SPARK-38961|https://issues.apache.org/jira/browse/SPARK-38961]) > At this point, we need to verify the version of pandas. It can be applied > after the docker image used in github action is upgraded and republished at > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage. > Related: https://github.com/apache/spark/pull/36509 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list
[ https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581871#comment-17581871 ] Apache Spark commented on SPARK-38961: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37583 > Enhance to automatically generate the pandas API support list > - > > Key: SPARK-38961 > URL: https://issues.apache.org/jira/browse/SPARK-38961 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > Currently, the supported pandas API list is manually maintained, so it would > be better to make the list automatically generated to reduce the maintenance > cost. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list
[ https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581869#comment-17581869 ] Apache Spark commented on SPARK-38961: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37583 > Enhance to automatically generate the pandas API support list > - > > Key: SPARK-38961 > URL: https://issues.apache.org/jira/browse/SPARK-38961 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > Currently, the supported pandas API list is manually maintained, so it would > be better to make the list automatically generated to reduce the maintenance > cost. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40050) Enhance EliminateSorts to support removing sorts via LocalLimit
[ https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40050: Summary: Enhance EliminateSorts to support removing sorts via LocalLimit (was: Eliminate the Sort if there is a LocalLimit between Join and Sort) > Enhance EliminateSorts to support removing sorts via LocalLimit > --- > > Key: SPARK-40050 > URL: https://issues.apache.org/jira/browse/SPARK-40050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > > It seems we can remove Sort operator: > {code:scala} > val projectPlan = testRelation.select($"a", $"b") > val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc) > val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan) > val projectPlanB = testRelationB.select($"d") > val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", > $"d") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40050) Eliminate the Sort if there is a LocalLimit between Join and Sort
[ https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-40050: --- Assignee: Yuming Wang > Eliminate the Sort if there is a LocalLimit between Join and Sort > - > > Key: SPARK-40050 > URL: https://issues.apache.org/jira/browse/SPARK-40050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > It seems we can remove Sort operator: > {code:scala} > val projectPlan = testRelation.select($"a", $"b") > val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc) > val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan) > val projectPlanB = testRelationB.select($"d") > val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", > $"d") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40050) Eliminate the Sort if there is a LocalLimit between Join and Sort
[ https://issues.apache.org/jira/browse/SPARK-40050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-40050. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37519 [https://github.com/apache/spark/pull/37519] > Eliminate the Sort if there is a LocalLimit between Join and Sort > - > > Key: SPARK-40050 > URL: https://issues.apache.org/jira/browse/SPARK-40050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > > It seems we can remove Sort operator: > {code:scala} > val projectPlan = testRelation.select($"a", $"b") > val unnecessaryOrderByPlan = projectPlan.orderBy($"a".asc) > val localLimitPlan = LocalLimit(Literal(2), unnecessaryOrderByPlan) > val projectPlanB = testRelationB.select($"d") > val joinPlan = localLimitPlan.join(projectPlanB, RightOuter).select($"a", > $"d") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40133) Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is true
[ https://issues.apache.org/jira/browse/SPARK-40133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-40133. - Fix Version/s: 3.4.0 Assignee: Yuming Wang Resolution: Fixed Resolved by https://github.com/apache/spark/pull/37562 > Regenerate excludedTpcdsQueries's golden files if regenerateGoldenFiles is > true > --- > > Key: SPARK-40133 > URL: https://issues.apache.org/jira/browse/SPARK-40133 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40147) Make pyspark.sql.session examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581824#comment-17581824 ] Apache Spark commented on SPARK-40147: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37582 > Make pyspark.sql.session examples self-contained > > > Key: SPARK-40147 > URL: https://issues.apache.org/jira/browse/SPARK-40147 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40147) Make pyspark.sql.session examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40147: Assignee: (was: Apache Spark) > Make pyspark.sql.session examples self-contained > > > Key: SPARK-40147 > URL: https://issues.apache.org/jira/browse/SPARK-40147 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40147) Make pyspark.sql.session examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581823#comment-17581823 ] Apache Spark commented on SPARK-40147: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37582 > Make pyspark.sql.session examples self-contained > > > Key: SPARK-40147 > URL: https://issues.apache.org/jira/browse/SPARK-40147 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40147) Make pyspark.sql.session examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40147: Assignee: Apache Spark > Make pyspark.sql.session examples self-contained > > > Key: SPARK-40147 > URL: https://issues.apache.org/jira/browse/SPARK-40147 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581821#comment-17581821 ] Apache Spark commented on SPARK-40142: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37581 > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40146) Simply the codegen of getting map value
[ https://issues.apache.org/jira/browse/SPARK-40146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40146. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37580 [https://github.com/apache/spark/pull/37580] > Simply the codegen of getting map value > --- > > Key: SPARK-40146 > URL: https://issues.apache.org/jira/browse/SPARK-40146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40150) Dynamically merge File Splits
Jackey Lee created SPARK-40150: -- Summary: Dynamically merge File Splits Key: SPARK-40150 URL: https://issues.apache.org/jira/browse/SPARK-40150 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Jackey Lee We currently use maxPartitionBytes and minPartitionNum to split files and use openCostInBytes to merge file splits. But these are static configurations, and the same configuration does not work in all scenarios. This PR attempts to dynamically merge file splits, taking into the concurrency while processing more data in one task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
Otakar Truněček created SPARK-40149: --- Summary: Star expansion after outer join asymmetrically includes joining key Key: SPARK-40149 URL: https://issues.apache.org/jira/browse/SPARK-40149 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.2, 3.3.0, 3.2.1, 3.2.0 Reporter: Otakar Truněček When star expansion is used on left side of a join, the result will include joining key, while on the right side of join it doesn't. I would expect the behaviour to be symmetric (either include on both sides or on neither). Example: {code:python} from pyspark.sql import SparkSession import pyspark.sql.functions as f spark = SparkSession.builder.getOrCreate() df_left = spark.range(5).withColumn('val', f.lit('left')) df_right = spark.range(3, 7).withColumn('val', f.lit('right')) df_merged = ( df_left .alias('left') .join(df_right.alias('right'), on='id', how='full_outer') .withColumn('left_all', f.struct('left.*')) .withColumn('right_all', f.struct('right.*')) ) df_merged.show() {code} result: {code:java} +---++-++-+ | id| val| val|left_all|right_all| +---++-++-+ | 0|left| null| {0, left}| {null}| | 1|left| null| {1, left}| {null}| | 2|left| null| {2, left}| {null}| | 3|left|right| {3, left}| {right}| | 4|left|right| {4, left}| {right}| | 5|null|right|{null, null}| {right}| | 6|null|right|{null, null}| {right}| +---++-++-+ {code} This behaviour started with release 3.2.0. Previously the key was not included on either side. Result from Spark 3.1.3 {code:java} +---++-++-+ | id| val| val|left_all|right_all| +---++-++-+ | 0|left| null| {left}| {null}| | 6|null|right| {null}| {right}| | 5|null|right| {null}| {right}| | 1|left| null| {left}| {null}| | 3|left|right| {left}| {right}| | 2|left| null| {left}| {null}| | 4|left|right| {left}| {right}| +---++-++-+ {code} I have a gut feeling this is related to these issues: https://issues.apache.org/jira/browse/SPARK-39376 https://issues.apache.org/jira/browse/SPARK-34527 https://issues.apache.org/jira/browse/SPARK-38603 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40148) Make pyspark.sql.window examples self-contained
Hyukjin Kwon created SPARK-40148: Summary: Make pyspark.sql.window examples self-contained Key: SPARK-40148 URL: https://issues.apache.org/jira/browse/SPARK-40148 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40098) Format error messages in the Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-40098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40098. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37520 [https://github.com/apache/spark/pull/37520] > Format error messages in the Thrift Server > -- > > Key: SPARK-40098 > URL: https://issues.apache.org/jira/browse/SPARK-40098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > # Introduce a config to control the format of error messages: plain text and > JSON > # Modify the Thrift Server to output errors from Spark SQL according to the > config -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40138) Implement DataFrame.mode
[ https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40138. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37569 [https://github.com/apache/spark/pull/37569] > Implement DataFrame.mode > > > Key: SPARK-40138 > URL: https://issues.apache.org/jira/browse/SPARK-40138 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40138) Implement DataFrame.mode
[ https://issues.apache.org/jira/browse/SPARK-40138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40138: Assignee: Ruifeng Zheng > Implement DataFrame.mode > > > Key: SPARK-40138 > URL: https://issues.apache.org/jira/browse/SPARK-40138 > Project: Spark > Issue Type: Improvement > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40142: Assignee: (was: Apache Spark) > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40142: Assignee: Apache Spark > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40142: - Fix Version/s: (was: 3.4.0) > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40142: Assignee: (was: Hyukjin Kwon) > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40147) Make pyspark.sql.session examples self-contained
Hyukjin Kwon created SPARK-40147: Summary: Make pyspark.sql.session examples self-contained Key: SPARK-40147 URL: https://issues.apache.org/jira/browse/SPARK-40147 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-40142: -- > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40142: Assignee: Hyukjin Kwon > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40142) Make pyspark.sql.functions examples self-contained
[ https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40142. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37575 [https://github.com/apache/spark/pull/37575] > Make pyspark.sql.functions examples self-contained > -- > > Key: SPARK-40142 > URL: https://issues.apache.org/jira/browse/SPARK-40142 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40146) Simply the codegen of getting map value
[ https://issues.apache.org/jira/browse/SPARK-40146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581718#comment-17581718 ] Apache Spark commented on SPARK-40146: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37580 > Simply the codegen of getting map value > --- > > Key: SPARK-40146 > URL: https://issues.apache.org/jira/browse/SPARK-40146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40146) Simply the codegen of getting map value
[ https://issues.apache.org/jira/browse/SPARK-40146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581716#comment-17581716 ] Apache Spark commented on SPARK-40146: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/37580 > Simply the codegen of getting map value > --- > > Key: SPARK-40146 > URL: https://issues.apache.org/jira/browse/SPARK-40146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40146) Simply the codegen of getting map value
[ https://issues.apache.org/jira/browse/SPARK-40146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40146: Assignee: Apache Spark (was: Gengliang Wang) > Simply the codegen of getting map value > --- > > Key: SPARK-40146 > URL: https://issues.apache.org/jira/browse/SPARK-40146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40146) Simply the codegen of getting map value
[ https://issues.apache.org/jira/browse/SPARK-40146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40146: Assignee: Gengliang Wang (was: Apache Spark) > Simply the codegen of getting map value > --- > > Key: SPARK-40146 > URL: https://issues.apache.org/jira/browse/SPARK-40146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40146) Simply the codegen of getting map value
[ https://issues.apache.org/jira/browse/SPARK-40146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-40146: --- Issue Type: Improvement (was: Bug) > Simply the codegen of getting map value > --- > > Key: SPARK-40146 > URL: https://issues.apache.org/jira/browse/SPARK-40146 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40146) Simply the codegen of getting map value
Gengliang Wang created SPARK-40146: -- Summary: Simply the codegen of getting map value Key: SPARK-40146 URL: https://issues.apache.org/jira/browse/SPARK-40146 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40140) REST API for SQL level information does not show information on running queries
[ https://issues.apache.org/jira/browse/SPARK-40140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581708#comment-17581708 ] Yeachan Park commented on SPARK-40140: -- Please feel free to pick it up :) > REST API for SQL level information does not show information on running > queries > --- > > Key: SPARK-40140 > URL: https://issues.apache.org/jira/browse/SPARK-40140 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yeachan Park >Priority: Minor > > Hi All, > We noticed that the SQL information REST API implemented in > https://issues.apache.org/jira/browse/SPARK-27142 does not return back SQL > queries which are currently running. We can only see queries which are > completed/failed. > As far as I can see, this should be supported since one of the fields in the > returned JSON is "runningJobIds". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40145) Create infra image when cut down branches
[ https://issues.apache.org/jira/browse/SPARK-40145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581701#comment-17581701 ] Apache Spark commented on SPARK-40145: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37579 > Create infra image when cut down branches > - > > Key: SPARK-40145 > URL: https://issues.apache.org/jira/browse/SPARK-40145 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40145) Create infra image when cut down branches
[ https://issues.apache.org/jira/browse/SPARK-40145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40145: Assignee: (was: Apache Spark) > Create infra image when cut down branches > - > > Key: SPARK-40145 > URL: https://issues.apache.org/jira/browse/SPARK-40145 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40145) Create infra image when cut down branches
[ https://issues.apache.org/jira/browse/SPARK-40145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581699#comment-17581699 ] Apache Spark commented on SPARK-40145: -- User 'Yikun' has created a pull request for this issue: https://github.com/apache/spark/pull/37579 > Create infra image when cut down branches > - > > Key: SPARK-40145 > URL: https://issues.apache.org/jira/browse/SPARK-40145 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39993) Spark on Kubernetes doesn't filter data by date
[ https://issues.apache.org/jira/browse/SPARK-39993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581700#comment-17581700 ] Hanna Liashchuk commented on SPARK-39993: - Could you try running it in client mode? Cause that's exactly what is happening, Jupyterhub is running in client mode. And yes, I run df.show() first to ensure that the df contains data, that's in the snippet too. > Spark on Kubernetes doesn't filter data by date > --- > > Key: SPARK-39993 > URL: https://issues.apache.org/jira/browse/SPARK-39993 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.2 > Environment: Kubernetes v1.23.6 > Spark 3.2.2 > Java 1.8.0_312 > Python 3.9.13 > Aws dependencies: > aws-java-sdk-bundle-1.11.901.jar and hadoop-aws-3.3.1.jar >Reporter: Hanna Liashchuk >Priority: Major > Labels: kubernetes > > I'm creating a Dataset with type date and saving it into s3. When I read it > and try to use where() clause, I've noticed it doesn't return data even > though it's there > Below is the code snippet I'm running > > {code:java} > from pyspark.sql.types import Row > from pyspark.sql.functions import * > ds = spark.range(10).withColumn("date", lit("2022-01-01")).withColumn("date", > col("date").cast("date")) > ds.where("date = '2022-01-01'").show() > ds.write.mode("overwrite").parquet("s3a://bucket/test") > df = spark.read.format("parquet").load("s3a://bucket/test") > df.where("date = '2022-01-01'").show() > {code} > The first show() returns data, while the second one - no. > I've noticed that it's Kubernetes master related, as the same code snipped > works ok with master "local" > UPD: if the column is used as a partition and has the type "date" there is no > filtering problem. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40145) Create infra image when cut down branches
[ https://issues.apache.org/jira/browse/SPARK-40145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40145: Assignee: Apache Spark > Create infra image when cut down branches > - > > Key: SPARK-40145 > URL: https://issues.apache.org/jira/browse/SPARK-40145 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Yikun Jiang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org