[jira] [Assigned] (SPARK-43076) Removing the dependency on `grpcio` when remote session is not used.
[ https://issues.apache.org/jira/browse/SPARK-43076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43076: Assignee: Haejoon Lee > Removing the dependency on `grpcio` when remote session is not used. > > > Key: SPARK-43076 > URL: https://issues.apache.org/jira/browse/SPARK-43076 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We should not enforce to install `grpcio` when remote session is not used for > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43076) Removing the dependency on `grpcio` when remote session is not used.
[ https://issues.apache.org/jira/browse/SPARK-43076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43076. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40722 [https://github.com/apache/spark/pull/40722] > Removing the dependency on `grpcio` when remote session is not used. > > > Key: SPARK-43076 > URL: https://issues.apache.org/jira/browse/SPARK-43076 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > We should not enforce to install `grpcio` when remote session is not used for > pandas API on Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42992) Introduce PySparkRuntimeError
[ https://issues.apache.org/jira/browse/SPARK-42992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-42992: - Assignee: Haejoon Lee > Introduce PySparkRuntimeError > - > > Key: SPARK-42992 > URL: https://issues.apache.org/jira/browse/SPARK-42992 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Introduce PySparkRuntimeError to cover the RuntimeError in PySpark-specific > way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42992) Introduce PySparkRuntimeError
[ https://issues.apache.org/jira/browse/SPARK-42992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-42992. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40617 [https://github.com/apache/spark/pull/40617] > Introduce PySparkRuntimeError > - > > Key: SPARK-42992 > URL: https://issues.apache.org/jira/browse/SPARK-42992 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > Introduce PySparkRuntimeError to cover the RuntimeError in PySpark-specific > way. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43275) Migrate Spark Connect GroupedData error into error class
[ https://issues.apache.org/jira/browse/SPARK-43275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43275. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40938 [https://github.com/apache/spark/pull/40938] > Migrate Spark Connect GroupedData error into error class > > > Key: SPARK-43275 > URL: https://issues.apache.org/jira/browse/SPARK-43275 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > Migrate Spark Connect GroupedData error into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43275) Migrate Spark Connect GroupedData error into error class
[ https://issues.apache.org/jira/browse/SPARK-43275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43275: - Assignee: Haejoon Lee > Migrate Spark Connect GroupedData error into error class > > > Key: SPARK-43275 > URL: https://issues.apache.org/jira/browse/SPARK-43275 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Migrate Spark Connect GroupedData error into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43274) Introduce `PySparkNotImplementError`
[ https://issues.apache.org/jira/browse/SPARK-43274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43274. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40938 [https://github.com/apache/spark/pull/40938] > Introduce `PySparkNotImplementError` > > > Key: SPARK-43274 > URL: https://issues.apache.org/jira/browse/SPARK-43274 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > Introduce `PySparkNotImplementError` corresponding for `NotImplementError` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43274) Introduce `PySparkNotImplementError`
[ https://issues.apache.org/jira/browse/SPARK-43274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43274: - Assignee: Haejoon Lee > Introduce `PySparkNotImplementError` > > > Key: SPARK-43274 > URL: https://issues.apache.org/jira/browse/SPARK-43274 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Introduce `PySparkNotImplementError` corresponding for `NotImplementError` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43291) Match behavior for DataFrame.cov on string DataFrame
[ https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43291: Summary: Match behavior for DataFrame.cov on string DataFrame (was: Re-enable test for DataFrame.cov on string DataFrame.) > Match behavior for DataFrame.cov on string DataFrame > > > Key: SPARK-43291 > URL: https://issues.apache.org/jira/browse/SPARK-43291 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Should enable test below: > {code:java} > pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], > columns=["a", "b"]) > psdf = ps.from_pandas(pdf) > self.assert_eq(pdf.cov(), psdf.cov()) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43291) Re-enable test for DataFrame.cov on string DataFrame.
Haejoon Lee created SPARK-43291: --- Summary: Re-enable test for DataFrame.cov on string DataFrame. Key: SPARK-43291 URL: https://issues.apache.org/jira/browse/SPARK-43291 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee Should enable test below: {code:java} pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], columns=["a", "b"]) psdf = ps.from_pandas(pdf) self.assert_eq(pdf.cov(), psdf.cov()) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43290) Support IV and AAD optional parameters for aes_encrypt
Steve Weis created SPARK-43290: -- Summary: Support IV and AAD optional parameters for aes_encrypt Key: SPARK-43290 URL: https://issues.apache.org/jira/browse/SPARK-43290 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Steve Weis There are some use cases where callers to aes_encrypt may want to provide initialization vectors (IVs) or additional authenticated data (AAD). The most common cases will be: 1. Ensuring that ciphertext matches values that have been encrypted by external tools. In those cases, the caller will need to provide an identical IV value. 2. For AES-CBC mode, there are some cases where callers want to generate deterministic encrypted output. 3. For AES-GCM mode, providing AAD fields allows callers to bind additional data to an encrypted ciphertext so that it can only be decrypted by a caller providing the same value. This is often used to enforce some context. The proposed new API is the following: * aes_encrypt(expr, key [, mode [, padding [, iv [, aad) * aes_decrypt(expr, key [, mode [, padding [, aad]]]) These fields are only supported for specific modes: * ECB: Does not support either IV or AAD and will return an error if either are provided. * CBC: Only supports an IV and will return an error if an AAD is provided * GCM: Supports either IV, AAD, or both. If a caller is only providing an AAD to GCM mode, they would need to pass a null value in the IV field. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
[ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43156: --- Assignee: Jack Chen > Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` > > > Key: SPARK-43156 > URL: https://issues.apache.org/jira/browse/SPARK-43156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > > Example query: > {code:java} > spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) > from t0").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) > {code} > In this subquery, count(1) always evaluates to a non-null integer value, so > count(1) is null is always false. The correct evaluation of the subquery is > always false. > We incorrectly evaluate it to null for empty groups. The reason is that > NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] > [false] - this rewrite would be correct normally, but in the context of a > scalar subquery it breaks our count bug handling in > RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get > there, the query appears to not have the count bug - it looks the same as if > the original query had a subquery with select any_value(false) from r..., and > that case is _not_ subject to the count bug. > > Postgres comparison show correct always-false result: > [http://sqlfiddle.com/#!17/67822/5] > DDL for the example: > {code:java} > create or replace temp view t0 (a, b) > as values > (1, 1.0), > (2, 2.0); > create or replace temp view t1 (c, d) > as values > (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
[ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43156. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40946 [https://github.com/apache/spark/pull/40946] > Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` > > > Key: SPARK-43156 > URL: https://issues.apache.org/jira/browse/SPARK-43156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.5.0 > > > Example query: > {code:java} > spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) > from t0").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) > {code} > In this subquery, count(1) always evaluates to a non-null integer value, so > count(1) is null is always false. The correct evaluation of the subquery is > always false. > We incorrectly evaluate it to null for empty groups. The reason is that > NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] > [false] - this rewrite would be correct normally, but in the context of a > scalar subquery it breaks our count bug handling in > RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get > there, the query appears to not have the count bug - it looks the same as if > the original query had a subquery with select any_value(false) from r..., and > that case is _not_ subject to the count bug. > > Postgres comparison show correct always-false result: > [http://sqlfiddle.com/#!17/67822/5] > DDL for the example: > {code:java} > create or replace temp view t0 (a, b) > as values > (1, 1.0), > (2, 2.0); > create or replace temp view t1 (c, d) > as values > (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43276) Migrate Spark Connect Window errors into error class
[ https://issues.apache.org/jira/browse/SPARK-43276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43276. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40939 [https://github.com/apache/spark/pull/40939] > Migrate Spark Connect Window errors into error class > > > Key: SPARK-43276 > URL: https://issues.apache.org/jira/browse/SPARK-43276 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.5.0 > > > Migrate Spark Connect Window errors into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43276) Migrate Spark Connect Window errors into error class
[ https://issues.apache.org/jira/browse/SPARK-43276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43276: - Assignee: Haejoon Lee > Migrate Spark Connect Window errors into error class > > > Key: SPARK-43276 > URL: https://issues.apache.org/jira/browse/SPARK-43276 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Migrate Spark Connect Window errors into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43289) PySpark UDF supports python package dependencies
[ https://issues.apache.org/jira/browse/SPARK-43289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-43289: -- Assignee: Weichen Xu > PySpark UDF supports python package dependencies > > > Key: SPARK-43289 > URL: https://issues.apache.org/jira/browse/SPARK-43289 > Project: Spark > Issue Type: New Feature > Components: Connect, ML, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > h3. Requirements > > Make the pyspark UDF support annotating python dependencies and when > executing UDF, the UDF worker creates a new python environment with provided > python dependencies. > h3. Motivation > > We have two major cases: > > * For spark connect case, the client python environment is very likely to be > different with pyspark server side python environment, this causes user's UDF > function execution failure in pyspark server side. > * Some machine learning third-party library (e.g. MLflow) requires pyspark > UDF supporting dependencies, because in ML cases, we need to run model > inference by pyspark UDF in the exactly the same python environment that > trains the model. Currently MLflow supports it by creating a child python > process in pyspark UDF worker, and redirecting all UDF input data to the > child python process to run model inference, this way it causes significant > overhead, if pyspark UDF support builtin python dependency management then we > don't need such poorly performing approach. > > h3. Proposed API > ``` > @pandas_udf("string", pip_requirements=...) > ``` > `pip_requirements` argument means either an iterable of pip requirement > strings (e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c > /path/to/constraints.txt"]``) or the string path to a pip requirements file > path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) > represents the pip requirements for the python UDF. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43289) PySpark UDF supports python package dependencies
Weichen Xu created SPARK-43289: -- Summary: PySpark UDF supports python package dependencies Key: SPARK-43289 URL: https://issues.apache.org/jira/browse/SPARK-43289 Project: Spark Issue Type: New Feature Components: Connect, ML, PySpark Affects Versions: 3.5.0 Reporter: Weichen Xu h3. Requirements Make the pyspark UDF support annotating python dependencies and when executing UDF, the UDF worker creates a new python environment with provided python dependencies. h3. Motivation We have two major cases: * For spark connect case, the client python environment is very likely to be different with pyspark server side python environment, this causes user's UDF function execution failure in pyspark server side. * Some machine learning third-party library (e.g. MLflow) requires pyspark UDF supporting dependencies, because in ML cases, we need to run model inference by pyspark UDF in the exactly the same python environment that trains the model. Currently MLflow supports it by creating a child python process in pyspark UDF worker, and redirecting all UDF input data to the child python process to run model inference, this way it causes significant overhead, if pyspark UDF support builtin python dependency management then we don't need such poorly performing approach. h3. Proposed API ``` @pandas_udf("string", pip_requirements=...) ``` `pip_requirements` argument means either an iterable of pip requirement strings (e.g. ``["scikit-learn", "-r /path/to/req2.txt", "-c /path/to/constraints.txt"]``) or the string path to a pip requirements file path on the local filesystem (e.g. ``"/path/to/requirements.txt"``) represents the pip requirements for the python UDF. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43277) Clean up deprecation hadoop api usage in Yarn module
[ https://issues.apache.org/jira/browse/SPARK-43277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-43277. -- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40940 > Clean up deprecation hadoop api usage in Yarn module > > > Key: SPARK-43277 > URL: https://issues.apache.org/jira/browse/SPARK-43277 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE
John Zhuge created SPARK-43288: -- Summary: DataSourceV2: CREATE TABLE LIKE Key: SPARK-43288 URL: https://issues.apache.org/jira/browse/SPARK-43288 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: John Zhuge Support CREATE TABLE LIKE in DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43020) Refactoring similar error classes such as `NOT_XXX`.
[ https://issues.apache.org/jira/browse/SPARK-43020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43020: Description: We should consolidate the error classes that have a similar error messages into single error class, or classify them into a main-sub error class structure. NOTE: This refactoring should be started after all other initial migration is done. was:We'd better to add main error class for type errors and switch the type-related errors into sub-error classes. > Refactoring similar error classes such as `NOT_XXX`. > - > > Key: SPARK-43020 > URL: https://issues.apache.org/jira/browse/SPARK-43020 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We should consolidate the error classes that have a similar error messages > into single error class, or classify them into a main-sub error class > structure. > NOTE: This refactoring should be started after all other initial migration is > done. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43020) Refactoring similar error classes such as `NOT_XXX`.
[ https://issues.apache.org/jira/browse/SPARK-43020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43020: Description: We should consolidate the error classes that have a similar error messages into single error class, or classify them into a main-sub error class structure. *NOTE:* This refactoring should be started after all other initial migration is done. was: We should consolidate the error classes that have a similar error messages into single error class, or classify them into a main-sub error class structure. NOTE: This refactoring should be started after all other initial migration is done. > Refactoring similar error classes such as `NOT_XXX`. > - > > Key: SPARK-43020 > URL: https://issues.apache.org/jira/browse/SPARK-43020 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We should consolidate the error classes that have a similar error messages > into single error class, or classify them into a main-sub error class > structure. > *NOTE:* This refactoring should be started after all other initial migration > is done. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43020) Refactoring similar error classes such as `NOT_XXX`.
[ https://issues.apache.org/jira/browse/SPARK-43020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-43020: Summary: Refactoring similar error classes such as `NOT_XXX`. (was: Add main error class for type errors) > Refactoring similar error classes such as `NOT_XXX`. > - > > Key: SPARK-43020 > URL: https://issues.apache.org/jira/browse/SPARK-43020 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > We'd better to add main error class for type errors and switch the > type-related errors into sub-error classes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43280) Reimplement the protobuf breaking change checker script
[ https://issues.apache.org/jira/browse/SPARK-43280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43280: -- Summary: Reimplement the protobuf breaking change checker script (was: Improve the protobuf breaking change checker script) > Reimplement the protobuf breaking change checker script > --- > > Key: SPARK-43280 > URL: https://issues.apache.org/jira/browse/SPARK-43280 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43136) Scala mapGroup, coGroup
[ https://issues.apache.org/jira/browse/SPARK-43136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-43136: - Assignee: Zhen Li > Scala mapGroup, coGroup > --- > > Key: SPARK-43136 > URL: https://issues.apache.org/jira/browse/SPARK-43136 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > > Adding Basics of Dataset#groupByKey -> KeyValueGroupedDataset support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43136) Scala mapGroup, coGroup
[ https://issues.apache.org/jira/browse/SPARK-43136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-43136. --- Fix Version/s: 3.5.0 Resolution: Fixed > Scala mapGroup, coGroup > --- > > Key: SPARK-43136 > URL: https://issues.apache.org/jira/browse/SPARK-43136 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > Fix For: 3.5.0 > > > Adding Basics of Dataset#groupByKey -> KeyValueGroupedDataset support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
[ https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43287: Description: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu:~/oss-spark$ wei.liu:~/oss-spark$ wei.liu:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times was: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times > Connect JVM client REPL not correctly shut down if killed > - > > Key: SPARK-43287 > URL: https://issues.apache.org/jira/browse/SPARK-43287 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > How to reproduce: > # Start a scala client `./connector/connect/bin/spark-connect-scala-client` > # in another terminal, kill the process `kill ` > # Back to the client terminal, you can't see anything you type, but the > command still works > > > {code:java} > Spark session available as 'spark'. > _ __ __ __ > / ___/ __/ /__ / /___ ___ _/ /_ > \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ > ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ > // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ > /_/ > @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf > examples logs resource-managers > target > LICENSE artifacts connector graphx > mllib sbin tools > LICENSE-binary assembly core hadoop-cloud
[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
[ https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43287: Description: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times was: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times > Connect JVM client REPL not correctly shut down if killed > - > > Key: SPARK-43287 > URL: https://issues.apache.org/jira/browse/SPARK-43287 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > How to reproduce: > # Start a scala client `./connector/connect/bin/spark-connect-scala-client` > # in another terminal, kill the process `kill ` > # Back to the client terminal, you can't see anything you type, but the > command still works > > > {code:java} > Spark session available as 'spark'. > _ __ __ __ > / ___/ __/ /__ / /___ ___ _/ /_ > \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ > ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ > // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ > /_/ > @ wei.liu:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf > examples logs resource-managers > target > LICENSE artifacts connector graphx > mllib sbin tools > LICENSE-binary
[jira] [Updated] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
[ https://issues.apache.org/jira/browse/SPARK-43287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43287: Description: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works {code:java} Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf examples logs resource-managers target LICENSE artifacts connector graphx mllib sbin tools LICENSE-binary assembly core hadoop-cloud mllib-local scalastyle-config.xml NOTICE bin data hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary binder dependency-reduced-pom.xml launcher project spark-warehouse R build dev licenses python sql README.md common docs licenses-binary repl streaming wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ wei.liu@ip-10-110-19-234:~/oss-spark$ {code} I ran 'ls' above, and clicked return multiple times was: How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works ``` Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ *wei.liu*:*~/oss-spark*$ CONTRIBUTING.md appveyor.yml *conf* *examples* *logs* *resource-managers* *target* LICENSE *artifacts* *connector* *graphx* *mllib* *sbin* *tools* LICENSE-binary *assembly* *core* *hadoop-cloud* *mllib-local* scalastyle-config.xml NOTICE *bin* *data* hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary *binder* dependency-reduced-pom.xml *launcher* *project* *spark-warehouse* *R* *build* *dev* *licenses* *python* *sql* README.md *common* *docs* *licenses-binary* *repl* *streaming* *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ ``` I ran 'ls' above, and clicked return multiple times > Connect JVM client REPL not correctly shut down if killed > - > > Key: SPARK-43287 > URL: https://issues.apache.org/jira/browse/SPARK-43287 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > How to reproduce: > # Start a scala client `./connector/connect/bin/spark-connect-scala-client` > # in another terminal, kill the process `kill ` > # Back to the client terminal, you can't see anything you type, but the > command still works > > > {code:java} > Spark session available as 'spark'. > _ __ __ __ > / ___/ __/ /__ / /___ ___ _/ /_ > \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ > ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ > // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ > /_/ > @ wei.liu@ip-10-110-19-234:~/oss-spark$ CONTRIBUTING.md appveyor.yml conf > examples logs resource-managers > target > LICENSE artifacts connector graphx >
[jira] [Created] (SPARK-43287) Connect JVM client REPL not correctly shut down if killed
Wei Liu created SPARK-43287: --- Summary: Connect JVM client REPL not correctly shut down if killed Key: SPARK-43287 URL: https://issues.apache.org/jira/browse/SPARK-43287 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Wei Liu How to reproduce: # Start a scala client `./connector/connect/bin/spark-connect-scala-client` # in another terminal, kill the process `kill ` # Back to the client terminal, you can't see anything you type, but the command still works ``` Spark session available as 'spark'. _ __ __ __ / ___/ __/ /__ / /___ ___ _/ /_ \__ \/ __ \/ __ `/ ___/ //_/ / / / __ \/ __ \/ __ \/ _ \/ ___/ __/ ___/ / /_/ / /_/ / / / ,< / /___/ /_/ / / / / / / / __/ /__/ /_ // .___/\__,_/_/ /_/|_| \/\/_/ /_/_/ /_/\___/\___/\__/ /_/ @ *wei.liu*:*~/oss-spark*$ CONTRIBUTING.md appveyor.yml *conf* *examples* *logs* *resource-managers* *target* LICENSE *artifacts* *connector* *graphx* *mllib* *sbin* *tools* LICENSE-binary *assembly* *core* *hadoop-cloud* *mllib-local* scalastyle-config.xml NOTICE *bin* *data* hs_err_pid9062.log pom.xml scalastyle-on-compile.generated.xml NOTICE-binary *binder* dependency-reduced-pom.xml *launcher* *project* *spark-warehouse* *R* *build* *dev* *licenses* *python* *sql* README.md *common* *docs* *licenses-binary* *repl* *streaming* *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ *wei.liu*:*~/oss-spark*$ ``` I ran 'ls' above, and clicked return multiple times -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17
[ https://issues.apache.org/jira/browse/SPARK-43285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-43285. --- Fix Version/s: 3.5.0 Resolution: Fixed > ReplE2ESuite consistently fails with JDK 17 > --- > > Key: SPARK-43285 > URL: https://issues.apache.org/jira/browse/SPARK-43285 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > Fix For: 3.5.0 > > > [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] > from [~gurwls223]] > This test consistently fails with JDK 17: > {code:java} > [info] ReplE2ESuite: > [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds) > [info] java.lang.RuntimeException: REPL Timed out while running command: > [info] spark.sql("select 1").collect() > [info] > [info] Console output: > [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc > [info] at > org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87) > [info] at > org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > [info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224){code} > [https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647] > [https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907] > [https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802] > [https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201] > [https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17
[ https://issues.apache.org/jira/browse/SPARK-43285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell reassigned SPARK-43285: - Assignee: Venkata Sai Akhil Gudesa > ReplE2ESuite consistently fails with JDK 17 > --- > > Key: SPARK-43285 > URL: https://issues.apache.org/jira/browse/SPARK-43285 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Venkata Sai Akhil Gudesa >Assignee: Venkata Sai Akhil Gudesa >Priority: Major > > [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] > from [~gurwls223]] > This test consistently fails with JDK 17: > {code:java} > [info] ReplE2ESuite: > [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds) > [info] java.lang.RuntimeException: REPL Timed out while running command: > [info] spark.sql("select 1").collect() > [info] > [info] Console output: > [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc > [info] at > org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87) > [info] at > org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102) > [info] at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > [info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224){code} > [https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647] > [https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907] > [https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802] > [https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201] > [https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43286) Update CBC mode in aes_encrypt()/aes_decrypt() to not use KDF
Steve Weis created SPARK-43286: -- Summary: Update CBC mode in aes_encrypt()/aes_decrypt() to not use KDF Key: SPARK-43286 URL: https://issues.apache.org/jira/browse/SPARK-43286 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Steve Weis The current implementation of AES-CBC mode called via `{{{}aes_encrypt{}}}` and `{{{}aes_decrypt{}}}` uses a key derivation function (KDF) based on OpenSSL's [EVP_BytesToKey|https://www.openssl.org/docs/man3.0/man3/EVP_BytesToKey.html]. This is intended for generating keys based on passwords and OpenSSL's documents discourage its use: _"Newer applications should use a more modern algorithm"._ `{{{}aes_encrypt{}}}` and `{{{}aes_decrypt{}}}` should use the key directly in CBC mode, as it does for both GCM and ECB mode. The output should then be the initialization vector (IV) prepended to the ciphertext – as is done with GCM mode: {{(16-byte randomly generated IV | AES-CBC encrypted ciphertext)}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43284) _metadata.file_path regression
[ https://issues.apache.org/jira/browse/SPARK-43284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Lewis updated SPARK-43284: Summary: _metadata.file_path regression (was: _metadata.file_path) > _metadata.file_path regression > -- > > Key: SPARK-43284 > URL: https://issues.apache.org/jira/browse/SPARK-43284 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: David Lewis >Priority: Major > > As part of the [SparkPath > refactor](https://issues.apache.org/jira/browse/SPARK-41970) the behavior of > `_metadata.file_path` was inadvertently changed. In Spark 3.4+ it now returns > a non-encoded path string, as opposed to a url-encoded path string. > This ticket is to fix that regression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17
[ https://issues.apache.org/jira/browse/SPARK-43285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Sai Akhil Gudesa updated SPARK-43285: - Description: [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] from [~gurwls223]] This test consistently fails with JDK 17: {code:java} [info] ReplE2ESuite: [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds) [info] java.lang.RuntimeException: REPL Timed out while running command: [info] spark.sql("select 1").collect() [info] [info] Console output: [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc [info] at org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87) [info] at org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224){code} [https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647] [https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907] [https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802] [https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201] [https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414] was: [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] from [~gurwls223]] This test consistently fails with JDK 17: [info] ReplE2ESuite: [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds) [info] java.lang.RuntimeException: REPL Timed out while running command: [info] spark.sql("select 1").collect() [info] [info] Console output: [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc [info] at org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87) [info] at org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647] [https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907] [https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802] [https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201] [https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414] > ReplE2ESuite consistently fails with JDK 17 > --- > > Key: SPARK-43285 > URL: https://issues.apache.org/jira/browse/SPARK-43285 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] > from [~gurwls223]] > This test consistently fails with JDK 17: > {code:java} > [info] ReplE2ESuite: > [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds) > [info] java.lang.RuntimeException: REPL Timed out while running command: > [info] spark.sql("select 1").collect() > [info] > [info] Console output: > [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc > [info] at > org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87) > [info] at > org.apache.spark.sql.application.ReplE2ES
[jira] [Created] (SPARK-43285) ReplE2ESuite consistently fails with JDK 17
Venkata Sai Akhil Gudesa created SPARK-43285: Summary: ReplE2ESuite consistently fails with JDK 17 Key: SPARK-43285 URL: https://issues.apache.org/jira/browse/SPARK-43285 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Venkata Sai Akhil Gudesa [[Comment|https://github.com/apache/spark/pull/40675#discussion_r1174696470] from [~gurwls223]] This test consistently fails with JDK 17: [info] ReplE2ESuite: [info] - Simple query *** FAILED *** (10 seconds, 4 milliseconds) [info] java.lang.RuntimeException: REPL Timed out while running command: [info] spark.sql("select 1").collect() [info] [info] Console output: [info] Error output: Compiling (synthetic)/ammonite/predef/ArgsPredef.sc [info] at org.apache.spark.sql.application.ReplE2ESuite.runCommandsInShell(ReplE2ESuite.scala:87) [info] at org.apache.spark.sql.application.ReplE2ESuite.$anonfun$new$1(ReplE2ESuite.scala:102) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) [info] at org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [https://github.com/apache/spark/actions/runs/4780630672/jobs/8498505928#step:9:4647] [https://github.com/apache/spark/actions/runs/4774942961/jobs/8488946907] [https://github.com/apache/spark/actions/runs/4769162286/jobs/8479293802] [https://github.com/apache/spark/actions/runs/4759278349/jobs/8458399201] [https://github.com/apache/spark/actions/runs/4748319019/jobs/8434392414] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43284) _metadata.file_path
David Lewis created SPARK-43284: --- Summary: _metadata.file_path Key: SPARK-43284 URL: https://issues.apache.org/jira/browse/SPARK-43284 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: David Lewis As part of the [SparkPath refactor](https://issues.apache.org/jira/browse/SPARK-41970) the behavior of `_metadata.file_path` was inadvertently changed. In Spark 3.4+ it now returns a non-encoded path string, as opposed to a url-encoded path string. This ticket is to fix that regression. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43283) _metadata.file_path returns unescaled URLs
David Lewis created SPARK-43283: --- Summary: _metadata.file_path returns unescaled URLs Key: SPARK-43283 URL: https://issues.apache.org/jira/browse/SPARK-43283 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: David Lewis As part of https://issues.apache.org/jira/browse/SPARK-41970 we changed the encoding of the string returned by `_metadata.file_path` from url-encoded to hadoop-path encoded (i.e. not encoded). This ticket is to undo that behavior change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43156) Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null`
[ https://issues.apache.org/jira/browse/SPARK-43156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716312#comment-17716312 ] Ignite TC Bot commented on SPARK-43156: --- User 'jchen5' has created a pull request for this issue: https://github.com/apache/spark/pull/40946 > Correctness COUNT bug in correlated scalar subselect with `COUNT(*) is null` > > > Key: SPARK-43156 > URL: https://issues.apache.org/jira/browse/SPARK-43156 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jack Chen >Priority: Major > > Example query: > {code:java} > spark.sql("select *, (select (count(1)) is null from t1 where t0.a = t1.c) > from t0").collect() > res6: Array[org.apache.spark.sql.Row] = Array([1,1.0,null], [2,2.0,false]) > {code} > In this subquery, count(1) always evaluates to a non-null integer value, so > count(1) is null is always false. The correct evaluation of the subquery is > always false. > We incorrectly evaluate it to null for empty groups. The reason is that > NullPropagation rewrites Aggregate [c] [isnull(count(1))] to Aggregate [c] > [false] - this rewrite would be correct normally, but in the context of a > scalar subquery it breaks our count bug handling in > RewriteCorrelatedScalarSubquery.constructLeftJoins . By the time we get > there, the query appears to not have the count bug - it looks the same as if > the original query had a subquery with select any_value(false) from r..., and > that case is _not_ subject to the count bug. > > Postgres comparison show correct always-false result: > [http://sqlfiddle.com/#!17/67822/5] > DDL for the example: > {code:java} > create or replace temp view t0 (a, b) > as values > (1, 1.0), > (2, 2.0); > create or replace temp view t1 (c, d) > as values > (2, 3.0); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43098) Should not handle the COUNT bug when the GROUP BY clause of a correlated scalar subquery is non-empty
[ https://issues.apache.org/jira/browse/SPARK-43098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716313#comment-17716313 ] Ignite TC Bot commented on SPARK-43098: --- User 'jchen5' has created a pull request for this issue: https://github.com/apache/spark/pull/40946 > Should not handle the COUNT bug when the GROUP BY clause of a correlated > scalar subquery is non-empty > - > > Key: SPARK-43098 > URL: https://issues.apache.org/jira/browse/SPARK-43098 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > From [~allisonwang-db] : > There is no COUNT bug when the correlated equality predicates are also in the > group by clause. However, the current logic to handle the COUNT bug still > adds default aggregate function value and returns incorrect results. > > {code:java} > create view t1(c1, c2) as values (0, 1), (1, 2); > create view t2(c1, c2) as values (0, 2), (0, 3); > select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from > t1; > -- Correct answer: [(0, 1, 2), (1, 2, null)] > +---+---+--+ > |c1 |c2 |scalarsubquery(c1)| > +---+---+--+ > |0 |1 |2 | > |1 |2 |0 | > +---+---+--+ > {code} > > This bug affects scalar subqueries in RewriteCorrelatedScalarSubquery, but > lateral subqueries handle it correctly in DecorrelateInnerQuery. Related: > https://issues.apache.org/jira/browse/SPARK-36113 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36112) Enable DecorrelateInnerQuery for IN/EXISTS subqueries
[ https://issues.apache.org/jira/browse/SPARK-36112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716289#comment-17716289 ] Jia Fan commented on SPARK-36112: - Hi, [~allisonwang-db] I checked the code. Seem like the only work is change the code in `PullupCorrelatedPredicates`. Just make sure Exists invoke `decorrelate`. !image-2023-04-25-21-51-55-961.png|width=617,height=275! Because `DecorrelateInnerQuery` already support Filter in subQuery. And Exists also be supported in `RewritePredicateSubquery`. Should I change just one line? Or is there something else I don't understand? > Enable DecorrelateInnerQuery for IN/EXISTS subqueries > - > > Key: SPARK-36112 > URL: https://issues.apache.org/jira/browse/SPARK-36112 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > Attachments: image-2023-04-25-21-51-55-961.png > > > Currently, `DecorrelateInnerQuery` is only enabled for scalar and lateral > subqueries. We should enable `DecorrelateInnerQuery` for IN/EXISTS > subqueries. Note we need to add the logic to rewrite domain joins in > `RewritePredicateSubquery`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
[ https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-43225: - Issue Type: Improvement (was: Bug) > Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution > -- > > Key: SPARK-43225 > URL: https://issues.apache.org/jira/browse/SPARK-43225 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Minor > > To fix CVE issue: https://github.com/apache/spark/security/dependabot/50 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
[ https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-43225: - Priority: Minor (was: Major) > Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution > -- > > Key: SPARK-43225 > URL: https://issues.apache.org/jira/browse/SPARK-43225 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Minor > > To fix CVE issue: https://github.com/apache/spark/security/dependabot/50 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43225) Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution
[ https://issues.apache.org/jira/browse/SPARK-43225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-43225. -- Fix Version/s: 3.5.0 Assignee: Yuming Wang Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40893 > Remove jackson-core-asl and jackson-mapper-asl from pre-built distribution > -- > > Key: SPARK-43225 > URL: https://issues.apache.org/jira/browse/SPARK-43225 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Minor > Fix For: 3.5.0 > > > To fix CVE issue: https://github.com/apache/spark/security/dependabot/50 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42798) Upgrade protobuf-java to 3.22.2
[ https://issues.apache.org/jira/browse/SPARK-42798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42798: - Priority: Minor (was: Major) > Upgrade protobuf-java to 3.22.2 > --- > > Key: SPARK-42798 > URL: https://issues.apache.org/jira/browse/SPARK-42798 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > > * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.1] > * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.2] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42798) Upgrade protobuf-java to 3.22.2
[ https://issues.apache.org/jira/browse/SPARK-42798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42798. -- Fix Version/s: 3.5.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/40430 > Upgrade protobuf-java to 3.22.2 > --- > > Key: SPARK-42798 > URL: https://issues.apache.org/jira/browse/SPARK-42798 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.1] > * [https://github.com/protocolbuffers/protobuf/releases/tag/v22.2] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36112) Enable DecorrelateInnerQuery for IN/EXISTS subqueries
[ https://issues.apache.org/jira/browse/SPARK-36112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jia Fan updated SPARK-36112: Attachment: image-2023-04-25-21-51-55-961.png > Enable DecorrelateInnerQuery for IN/EXISTS subqueries > - > > Key: SPARK-36112 > URL: https://issues.apache.org/jira/browse/SPARK-36112 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Allison Wang >Priority: Major > Attachments: image-2023-04-25-21-51-55-961.png > > > Currently, `DecorrelateInnerQuery` is only enabled for scalar and lateral > subqueries. We should enable `DecorrelateInnerQuery` for IN/EXISTS > subqueries. Note we need to add the logic to rewrite domain joins in > `RewritePredicateSubquery`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation
[ https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716268#comment-17716268 ] Jean-Christophe Lefebvre edited comment on SPARK-39753 at 4/25/23 1:40 PM: --- Any developement on this ticket? was (Author: JIRAUSER300051): Any developpement on this ticket? > Broadcast joins should pushdown join constraints as Filter to the larger > relation > - > > Key: SPARK-39753 > URL: https://issues.apache.org/jira/browse/SPARK-39753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Victor Delépine >Priority: Major > > SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to > re-open it here for more visibility, since I believe this bug has a major > impact and that fixing it could drastically improve the performance of many > pipelines. > Allow me to paste the initial description again here: > _For broadcast inner-joins, where the smaller relation is known to be small > enough to materialize on a worker, the set of values for all join columns is > known and fits in memory. Spark should translate these values into a > {{Filter}} pushed down to the datasource. The common join condition of > equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} > clause. An example of pushing such filters is already present in the form of > {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_} > _This optimization could even work when the smaller relation does not fit > entirely in memory. This could be done by partitioning the smaller relation > into N pieces, applying this predicate pushdown for each piece, and unioning > the results._ > > Essentially, when doing a Broadcast join, the smaller side can be used to > filter down the bigger side before performing the join. As of today, the join > will read all partitions of the bigger side, without pruning partitions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39753) Broadcast joins should pushdown join constraints as Filter to the larger relation
[ https://issues.apache.org/jira/browse/SPARK-39753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716268#comment-17716268 ] Jean-Christophe Lefebvre commented on SPARK-39753: -- Any developpement on this ticket? > Broadcast joins should pushdown join constraints as Filter to the larger > relation > - > > Key: SPARK-39753 > URL: https://issues.apache.org/jira/browse/SPARK-39753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: Victor Delépine >Priority: Major > > SPARK-19609 was bulk-closed a while ago, but not fixed. I've decided to > re-open it here for more visibility, since I believe this bug has a major > impact and that fixing it could drastically improve the performance of many > pipelines. > Allow me to paste the initial description again here: > _For broadcast inner-joins, where the smaller relation is known to be small > enough to materialize on a worker, the set of values for all join columns is > known and fits in memory. Spark should translate these values into a > {{Filter}} pushed down to the datasource. The common join condition of > equality, i.e. {{{}lhs.a == rhs.a{}}}, can be written as an {{a in ...}} > clause. An example of pushing such filters is already present in the form of > {{IsNotNull}} filters via_ [~sameerag]{_}'s work on SPARK-12957 subtasks.{_} > _This optimization could even work when the smaller relation does not fit > entirely in memory. This could be done by partitioning the smaller relation > into N pieces, applying this predicate pushdown for each piece, and unioning > the results._ > > Essentially, when doing a Broadcast join, the smaller side can be used to > filter down the bigger side before performing the join. As of today, the join > will read all partitions of the bigger side, without pruning partitions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43282) Investigate DataFrame.sort_values with pandas behavior.
Haejoon Lee created SPARK-43282: --- Summary: Investigate DataFrame.sort_values with pandas behavior. Key: SPARK-43282 URL: https://issues.apache.org/jira/browse/SPARK-43282 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark Affects Versions: 3.5.0 Reporter: Haejoon Lee {code:java} import pandas as pd pdf = pd.DataFrame( { "a": pd.Categorical([1, 2, 3, 1, 2, 3]), "b": pd.Categorical( ["b", "a", "c", "c", "b", "a"], categories=["c", "b", "d", "a"] ), }, ) pdf.groupby("a").apply(lambda x: x).sort_values(["a"]) Traceback (most recent call last): ... ValueError: 'a' is both an index level and a column label, which is ambiguous. {code} We should investigate this issue whether this is intended behavior or just bug in pandas. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716135#comment-17716135 ] Andrew Grigorev edited comment on SPARK-43273 at 4/25/23 12:56 PM: --- Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by default for their Parquet output format :). https://github.com/ClickHouse/ClickHouse/issues/49141 was (Author: ei-grad): Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by default for their Parquet output format :). > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 (together with other > parquet-mr libs) > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43281) Fix concurrent writer does not update file metrics
XiDuo You created SPARK-43281: - Summary: Fix concurrent writer does not update file metrics Key: SPARK-43281 URL: https://issues.apache.org/jira/browse/SPARK-43281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You It uses temp file path to get file status after commit task. However, the temp file has already moved to new path during commit task. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43272) Replace reflection w/ direct calling for `SparkHadoopUtil#createFile`
[ https://issues.apache.org/jira/browse/SPARK-43272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716236#comment-17716236 ] Nikita Awasthi commented on SPARK-43272: User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40945 > Replace reflection w/ direct calling for `SparkHadoopUtil#createFile` > -- > > Key: SPARK-43272 > URL: https://issues.apache.org/jira/browse/SPARK-43272 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43142) DSL expressions fail on attribute with special characters
[ https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43142. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40902 [https://github.com/apache/spark/pull/40902] > DSL expressions fail on attribute with special characters > - > > Key: SPARK-43142 > URL: https://issues.apache.org/jira/browse/SPARK-43142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Willi Raschkowski >Assignee: Willi Raschkowski >Priority: Major > Fix For: 3.5.0 > > > Expressions on implicitly converted attributes fail if the attributes have > names containing special characters. They fail even if the attributes are > backtick-quoted: > {code:java} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> "`slashed/col`".attr > res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = > 'slashed/col > scala> "`slashed/col`".attr.asc > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '/' expecting {, '.', '-'}(line 1, pos 7) > == SQL == > slashed/col > ---^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Grigorev updated SPARK-43273: Description: hadoop-parquet version should be updated to 1.3.0 (together with other parquet-mr libs) {code:java} java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW at java.base/java.lang.Enum.valueOf(Enum.java:273) at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) ... {code} was: hadoop-parquet version should be updated to 1.3.0 {code:java} java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW at java.base/java.lang.Enum.valueOf(Enum.java:273) at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) ... {code} > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 (together with other > parquet-mr libs) > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43280) Improve the protobuf breaking change checker script
[ https://issues.apache.org/jira/browse/SPARK-43280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43280: -- Summary: Improve the protobuf breaking change checker script (was: Improve the protobuf breaking change script) > Improve the protobuf breaking change checker script > --- > > Key: SPARK-43280 > URL: https://issues.apache.org/jira/browse/SPARK-43280 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43280) Improve the protobuf breaking change script
[ https://issues.apache.org/jira/browse/SPARK-43280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43280: -- Priority: Major (was: Blocker) > Improve the protobuf breaking change script > --- > > Key: SPARK-43280 > URL: https://issues.apache.org/jira/browse/SPARK-43280 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43280) Improve the protobuf breaking change script
Ruifeng Zheng created SPARK-43280: - Summary: Improve the protobuf breaking change script Key: SPARK-43280 URL: https://issues.apache.org/jira/browse/SPARK-43280 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43279) Cleanup unused members from `SparkHadoopUtil`
Yang Jie created SPARK-43279: Summary: Cleanup unused members from `SparkHadoopUtil` Key: SPARK-43279 URL: https://issues.apache.org/jira/browse/SPARK-43279 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43278) Exception in thread "main" java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
[ https://issues.apache.org/jira/browse/SPARK-43278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiangjiguang0719 updated SPARK-43278: - Description: Java version: 1.8.0_331, Apache Maven 3.8.4 I run next steps: # git clone [https://github.com/apache/spark.git] # git checkout -b v3.3.0 3.3.0 # mvn clean install -DskipTests # copy hive-site.xml to examples/src/main/resources/ # execute TPC-H Q6 {code:java} public static void main(String[] args) throws InterruptedException { SparkConf sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[1]") ; SparkSession sparkSession = SparkSession.builder() .config(sparkConf) .enableHiveSupport() .getOrCreate(); sparkSession.sql("use local_tpch_sf10_uncompressed_etl"); sparkSession.sql(TPCH.SQL6).show(); } {code} get the error info: Exception in thread "main" java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer; at org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:115) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:325) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:95) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:75) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1529) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.buildReaderWithPartitionValues(ParquetFileFormat.scala:235) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:457) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:448) at org.apache.spark.sql.execution.FileSourceScanExec.doExecuteColumnar(DataSourceScanExec.scala:547) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeColumnar$1(SparkPlan.scala:221) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:229) at org.apache.spark.sql.execution.SparkPlan.executeColumnar(SparkPlan.scala:217) was: Java version: 1.8.0_331, Apache Maven 3.8.4 I run next steps: # git clone [https://github.com/apache/spark.git] # git checkout -b v3.3.0 3.3.0 # mvn clean install -DskipTests # copy hive-site.xml to examples/src/main/resources/ # execute TPC-H Q6 !image-2023-04-25-17-14-50-392.png|width=437,height=246! get the error info !image-2023-04-25-17-15-57-874.png|width=466,height=161! > Exception in thread "main" java.lang.NoSuchMethodError: > java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer; > --- > > Key: SPARK-43278 > URL: https://issues.apache.org/jira/browse/SPARK-43278 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.3.0 >Reporter: jiangjiguang0719 >Priority: Major > > Java version: 1.8.0_331, Apache Maven 3.8.4 > I run next steps: > # git clone [https://github.com/apache/spark.git] > # git checkout -b v3.3.0 3.3.0 > # mvn clean install -DskipTests > # copy hive-site.xml to examples/src/main/resources/ > # execute TPC-H Q6 > > {code:java} > public static void main(String[] args) throws InterruptedException { > SparkConf sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[1]") > ; > SparkSession sparkSession = SparkSession.builder() > .config(sparkConf) > .enableHiveSupport() > .getOrCreate(); > sparkSession.sql("use local_tpch_sf10_uncompressed_etl"); > sparkSession.sql(TPCH.SQL6).show(); > } {code} > > > get the error info: > Exception in thread "main" java.lang.NoSuchMethodError: > java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer; > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:115) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:325) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:95) > at > org.apache.spark.br
[jira] [Created] (SPARK-43278) Exception in thread "main" java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer;
jiangjiguang0719 created SPARK-43278: Summary: Exception in thread "main" java.lang.NoSuchMethodError: java.nio.ByteBuffer.flip()Ljava/nio/ByteBuffer; Key: SPARK-43278 URL: https://issues.apache.org/jira/browse/SPARK-43278 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 3.3.0 Reporter: jiangjiguang0719 Java version: 1.8.0_331, Apache Maven 3.8.4 I run next steps: # git clone [https://github.com/apache/spark.git] # git checkout -b v3.3.0 3.3.0 # mvn clean install -DskipTests # copy hive-site.xml to examples/src/main/resources/ # execute TPC-H Q6 !image-2023-04-25-17-14-50-392.png|width=437,height=246! get the error info !image-2023-04-25-17-15-57-874.png|width=466,height=161! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42940) Session management support streaming connect
[ https://issues.apache.org/jira/browse/SPARK-42940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716181#comment-17716181 ] ASF GitHub Bot commented on SPARK-42940: User 'rangadi' has created a pull request for this issue: https://github.com/apache/spark/pull/40937 > Session management support streaming connect > > > Key: SPARK-42940 > URL: https://issues.apache.org/jira/browse/SPARK-42940 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Priority: Major > > Add session support for streaming jobs. > E.g. a session should stay alive when a streaming job is alive. > It might differ more complex scenarios like what happens when client loses > track of the session. Such semantics would be handled as part of session > semantics across Spark Connect (including streaming). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43204) Align MERGE assignments with table attributes
[ https://issues.apache.org/jira/browse/SPARK-43204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43204. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40919 [https://github.com/apache/spark/pull/40919] > Align MERGE assignments with table attributes > - > > Key: SPARK-43204 > URL: https://issues.apache.org/jira/browse/SPARK-43204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Fix For: 3.5.0 > > > Similar to SPARK-42151, we need to do the same for MERGE assignments. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43204) Align MERGE assignments with table attributes
[ https://issues.apache.org/jira/browse/SPARK-43204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43204: --- Assignee: Anton Okolnychyi > Align MERGE assignments with table attributes > - > > Key: SPARK-43204 > URL: https://issues.apache.org/jira/browse/SPARK-43204 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > > Similar to SPARK-42151, we need to do the same for MERGE assignments. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43277) Clean up deprecation hadoop api usage in Yarn module
Yang Jie created SPARK-43277: Summary: Clean up deprecation hadoop api usage in Yarn module Key: SPARK-43277 URL: https://issues.apache.org/jira/browse/SPARK-43277 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43226) Define extractors for file-constant metadata columns
[ https://issues.apache.org/jira/browse/SPARK-43226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43226. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40885 [https://github.com/apache/spark/pull/40885] > Define extractors for file-constant metadata columns > > > Key: SPARK-43226 > URL: https://issues.apache.org/jira/browse/SPARK-43226 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ryan Johnson >Assignee: Ryan Johnson >Priority: Major > Fix For: 3.5.0 > > > File-source constant metadata columns are often derived indirectly from > file-level metadata values rather than exposing those values directly. For > example, {{_metadata.file_name}} is currently hard-coded in > {{FileFormat.updateMetadataInternalRow}} as: > > {code:java} > UTF8String.fromString(filePath.getName){code} > > We should add support for metadata extractors, functions that map from > {{PartitionedFile}} to {{{}Literal{}}}, so that we can express such columns > in a generic way instead of hard-coding them. > We can't just add them to the metadata map because then they have to be > pre-computed even if it turns out the query does not select that field. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43226) Define extractors for file-constant metadata columns
[ https://issues.apache.org/jira/browse/SPARK-43226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43226: --- Assignee: Ryan Johnson > Define extractors for file-constant metadata columns > > > Key: SPARK-43226 > URL: https://issues.apache.org/jira/browse/SPARK-43226 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Ryan Johnson >Assignee: Ryan Johnson >Priority: Major > > File-source constant metadata columns are often derived indirectly from > file-level metadata values rather than exposing those values directly. For > example, {{_metadata.file_name}} is currently hard-coded in > {{FileFormat.updateMetadataInternalRow}} as: > > {code:java} > UTF8String.fromString(filePath.getName){code} > > We should add support for metadata extractors, functions that map from > {{PartitionedFile}} to {{{}Literal{}}}, so that we can express such columns > in a generic way instead of hard-coding them. > We can't just add them to the metadata map because then they have to be > pre-computed even if it turns out the query does not select that field. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43243) Add Level param to df.printSchema for Python API
[ https://issues.apache.org/jira/browse/SPARK-43243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43243: - Assignee: Khalid Mammadov > Add Level param to df.printSchema for Python API > > > Key: SPARK-43243 > URL: https://issues.apache.org/jira/browse/SPARK-43243 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Major > > Python printSchema in DataFrame API is missing level parameter which is > available in Scala API. This is to add that -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43243) Add Level param to df.printSchema for Python API
[ https://issues.apache.org/jira/browse/SPARK-43243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43243. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40916 [https://github.com/apache/spark/pull/40916] > Add Level param to df.printSchema for Python API > > > Key: SPARK-43243 > URL: https://issues.apache.org/jira/browse/SPARK-43243 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Major > Fix For: 3.5.0 > > > Python printSchema in DataFrame API is missing level parameter which is > available in Scala API. This is to add that -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43276) Migrate Spark Connect Window errors into error class
Haejoon Lee created SPARK-43276: --- Summary: Migrate Spark Connect Window errors into error class Key: SPARK-43276 URL: https://issues.apache.org/jira/browse/SPARK-43276 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Migrate Spark Connect Window errors into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43275) Migrate Spark Connect GroupedData error into error class
Haejoon Lee created SPARK-43275: --- Summary: Migrate Spark Connect GroupedData error into error class Key: SPARK-43275 URL: https://issues.apache.org/jira/browse/SPARK-43275 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Migrate Spark Connect GroupedData error into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43274) Introduce `PySparkNotImplementError`
Haejoon Lee created SPARK-43274: --- Summary: Introduce `PySparkNotImplementError` Key: SPARK-43274 URL: https://issues.apache.org/jira/browse/SPARK-43274 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Introduce `PySparkNotImplementError` corresponding for `NotImplementError` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43231) Reduce the memory requirement in torch-related tests
[ https://issues.apache.org/jira/browse/SPARK-43231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43231. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40874 [https://github.com/apache/spark/pull/40874] > Reduce the memory requirement in torch-related tests > > > Key: SPARK-43231 > URL: https://issues.apache.org/jira/browse/SPARK-43231 > Project: Spark > Issue Type: Test > Components: Connect, ML, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43231) Reduce the memory requirement in torch-related tests
[ https://issues.apache.org/jira/browse/SPARK-43231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43231: - Assignee: Ruifeng Zheng > Reduce the memory requirement in torch-related tests > > > Key: SPARK-43231 > URL: https://issues.apache.org/jira/browse/SPARK-43231 > Project: Spark > Issue Type: Test > Components: Connect, ML, PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17716135#comment-17716135 ] Andrew Grigorev commented on SPARK-43273: - Just as a icing on the cake - Clickhouse accidently started to use LZ4_RAW by default for their Parquet output format :). > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 > > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
[ https://issues.apache.org/jira/browse/SPARK-43273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Grigorev updated SPARK-43273: Description: hadoop-parquet version should be updated to 1.3.0 {code:java} java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): java.lang.IllegalArgumentException: No enum constant org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW at java.base/java.lang.Enum.valueOf(Enum.java:273) at org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) ... {code} was:hadoop-parquet version should be updated to 1.3.0 > Spark can't read parquet files with a newer LZ4_RAW compression > --- > > Key: SPARK-43273 > URL: https://issues.apache.org/jira/browse/SPARK-43273 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.4, 3.3.3, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Trivial > > hadoop-parquet version should be updated to 1.3.0 > > {code:java} > java.util.concurrent.ExecutionException: org.apache.spark.SparkException: Job > aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent > failure: Lost task 2.0 in stage 1.0 (TID 3) (f2b63fdfa0a6 executor driver): > java.lang.IllegalArgumentException: No enum constant > org.apache.parquet.hadoop.metadata.CompressionCodecName.LZ4_RAW > at java.base/java.lang.Enum.valueOf(Enum.java:273) > at > org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.fromFormatCodec(ParquetMetadataConverter.java:636) > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43273) Spark can't read parquet files with a newer LZ4_RAW compression
Andrew Grigorev created SPARK-43273: --- Summary: Spark can't read parquet files with a newer LZ4_RAW compression Key: SPARK-43273 URL: https://issues.apache.org/jira/browse/SPARK-43273 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0, 3.3.2, 3.2.4, 3.3.3 Reporter: Andrew Grigorev hadoop-parquet version should be updated to 1.3.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org