[jira] [Created] (SPARK-43211) Remove Hadoop2 support in IsolatedClientLoader
Cheng Pan created SPARK-43211: - Summary: Remove Hadoop2 support in IsolatedClientLoader Key: SPARK-43211 URL: https://issues.apache.org/jira/browse/SPARK-43211 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43210) Introduce PySparkAssersionError
Haejoon Lee created SPARK-43210: --- Summary: Introduce PySparkAssersionError Key: SPARK-43210 URL: https://issues.apache.org/jira/browse/SPARK-43210 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Introduce PySparkAssersionError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43209) Migrate Expression errors into error class
Haejoon Lee created SPARK-43209: --- Summary: Migrate Expression errors into error class Key: SPARK-43209 URL: https://issues.apache.org/jira/browse/SPARK-43209 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee Migrate Expression errors into error class -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714416#comment-17714416 ] Hyukjin Kwon commented on SPARK-42945: -- Reverted at https://github.com/apache/spark/commit/09a43531d30346bb7c8d213822513dc35c70f82e > Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect > --- > > Key: SPARK-42945 > URL: https://issues.apache.org/jira/browse/SPARK-42945 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-42945: -- > Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect > --- > > Key: SPARK-42945 > URL: https://issues.apache.org/jira/browse/SPARK-42945 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.5.0 > > > Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42945: - Fix Version/s: (was: 3.5.0) > Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect > --- > > Key: SPARK-42945 > URL: https://issues.apache.org/jira/browse/SPARK-42945 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43208) IsolatedClassLoader should close barrier class InputStream after reading
Cheng Pan created SPARK-43208: - Summary: IsolatedClassLoader should close barrier class InputStream after reading Key: SPARK-43208 URL: https://issues.apache.org/jira/browse/SPARK-43208 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`
[ https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43196: Assignee: Yang Jie > Replace reflection w/ direct calling for > `ContainerLaunchContext#setTokensConf` > --- > > Key: SPARK-43196 > URL: https://issues.apache.org/jira/browse/SPARK-43196 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`
[ https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43196. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40855 [https://github.com/apache/spark/pull/40855] > Replace reflection w/ direct calling for > `ContainerLaunchContext#setTokensConf` > --- > > Key: SPARK-43196 > URL: https://issues.apache.org/jira/browse/SPARK-43196 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext
[ https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43191. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40850 [https://github.com/apache/spark/pull/40850] > Replace reflection w/ direct calling for Hadoop CallerContext > -- > > Key: SPARK-43191 > URL: https://issues.apache.org/jira/browse/SPARK-43191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext
[ https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43191: Assignee: Cheng Pan > Replace reflection w/ direct calling for Hadoop CallerContext > -- > > Key: SPARK-43191 > URL: https://issues.apache.org/jira/browse/SPARK-43191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43200) Remove Hadoop 2 reference in docs
[ https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43200. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40857 [https://github.com/apache/spark/pull/40857] > Remove Hadoop 2 reference in docs > - > > Key: SPARK-43200 > URL: https://issues.apache.org/jira/browse/SPARK-43200 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43200) Remove Hadoop 2 reference in docs
[ https://issues.apache.org/jira/browse/SPARK-43200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43200: Assignee: Cheng Pan > Remove Hadoop 2 reference in docs > - > > Key: SPARK-43200 > URL: https://issues.apache.org/jira/browse/SPARK-43200 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-43197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714403#comment-17714403 ] Sun Chao commented on SPARK-43197: -- Thanks for the ping [~gurwls223] . Subscribed. > Clean up the code written for compatibility with Hadoop 2 > - > > Key: SPARK-43197 > URL: https://issues.apache.org/jira/browse/SPARK-43197 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, SQL, YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > SPARK-42452 removed support for Hadoop2, we can clean up the code written for > compatibility with Hadoop 2 to make it more concise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43195. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40854 [https://github.com/apache/spark/pull/40854] > Remove unnecessary serializable wrapper in HadoopFSUtils > > > Key: SPARK-43195 > URL: https://issues.apache.org/jira/browse/SPARK-43195 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43195: Assignee: Cheng Pan > Remove unnecessary serializable wrapper in HadoopFSUtils > > > Key: SPARK-43195 > URL: https://issues.apache.org/jira/browse/SPARK-43195 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables
[ https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif resolved SPARK-43112. -- Resolution: Not A Bug > Spark may use a column other than the actual specified partitioning column > for partitioning, for Hive format tables > > > Key: SPARK-43112 > URL: https://issues.apache.org/jira/browse/SPARK-43112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Asif >Priority: Major > > The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its > output method implemented as > // The partition column should always appear after data columns. > override def output: Seq[AttributeReference] = dataCols ++ partitionCols > But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect > that the output from HiveTableRelation is in the order in which the columns > are actually defined in the DDL. > As a result, multiple mismatch scenarios can happen like: > 1) data type casting exception being thrown , even though the data frame > being inserted has schema which is identical to what is used for creating ddl. > OR > 2) Wrong column being used for partitioning , if the datatypes are same or > cast-able, like date type and long > will be creating a PR with the bug test -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43206: Description: [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > Streaming query exception() also include stack trace > > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43206: Epic Link: SPARK-42938 > Streaming query exception() also include stack trace > > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > > [https://github.com/apache/spark/pull/40785#issuecomment-1515522281] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43167) Streaming Connect console output format support
[ https://issues.apache.org/jira/browse/SPARK-43167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu resolved SPARK-43167. - Resolution: Not A Problem automatically supported with existing Connect implementation > Streaming Connect console output format support > --- > > Key: SPARK-43167 > URL: https://issues.apache.org/jira/browse/SPARK-43167 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43206) Streaming query exception() also include stack trace
[ https://issues.apache.org/jira/browse/SPARK-43206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Liu updated SPARK-43206: Environment: (was: https://github.com/apache/spark/pull/40785#issuecomment-1515522281 ) > Streaming query exception() also include stack trace > > > Key: SPARK-43206 > URL: https://issues.apache.org/jira/browse/SPARK-43206 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Wei Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43194) PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0
[ https://issues.apache.org/jira/browse/SPARK-43194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43194: - Parent: SPARK-42618 Issue Type: Sub-task (was: Bug) > PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0 > -- > > Key: SPARK-43194 > URL: https://issues.apache.org/jira/browse/SPARK-43194 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 > Environment: {code} > In [4]: import pandas as pd > In [5]: pd.__version__ > Out[5]: '2.0.0' > In [6]: import pyspark as ps > In [7]: ps.__version__ > Out[7]: '3.4.0' > {code} >Reporter: Phillip Cloud >Priority: Major > > {code} > In [1]: from pyspark.sql import SparkSession > In [2]: session = SparkSession.builder.appName("test").getOrCreate() > 23/04/19 09:21:42 WARN Utils: Your hostname, albatross resolves to a loopback > address: 127.0.0.2; using 192.168.1.170 instead (on interface enp5s0) > 23/04/19 09:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 23/04/19 09:21:42 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > In [3]: session.sql("select now()").toPandas() > {code} > Results in: > {code} > ... > TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass > e.g. 'datetime64[ns]' instead. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"
[ https://issues.apache.org/jira/browse/SPARK-43189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714351#comment-17714351 ] Hyukjin Kwon commented on SPARK-43189: -- [~ei-grad] are you interested in submitting a PR? > No overload variant of "pandas_udf" matches argument type "str" > --- > > Key: SPARK-43189 > URL: https://issues.apache.org/jira/browse/SPARK-43189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.4, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Major > > h2. Issue > Users who have mypy enabled in their IDE or CI environment face very verbose > error messages when using the {{pandas_udf}} function in PySpark. The current > typing of the {{pandas_udf}} function seems to be causing these issues. As a > workaround, the official documentation provides examples that use {{{}# type: > ignore[call-overload]{}}}, but this is not an ideal solution. > h2. Example > Here's a code snippet taken from > [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs] > that triggers the error when mypy is enabled: > {code:python} > from pyspark.sql.functions import pandas_udf > import pandas as pd > @pandas_udf("col1 string, col2 long") > def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame: > s3['col2'] = s1 + s2.str.len() > return s3 {code} > Running mypy on this code results in a long and verbose error message, which > makes it difficult for users to understand the actual issue and how to > resolve it. > h2. Proposed Solution > We kindly request the PySpark development team to review and improve the > typing for the {{pandas_udf}} function to prevent these verbose error > messages from appearing. This improvement will help users who have mypy > enabled in their development environments to have a better experience when > using PySpark. > Furthermore, we suggest updating the official documentation to provide better > examples that do not rely on {{# type: ignore[call-overload]}} to suppress > these errors. > h2. Impact > By addressing this issue, users of PySpark with mypy enabled in their > development environment will be able to write and verify their code more > efficiently, without being overwhelmed by verbose error messages. This will > lead to a more enjoyable and productive experience when working with PySpark > and pandas UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-43197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714350#comment-17714350 ] Hyukjin Kwon commented on SPARK-43197: -- cc [~sunchao] FYI > Clean up the code written for compatibility with Hadoop 2 > - > > Key: SPARK-43197 > URL: https://issues.apache.org/jira/browse/SPARK-43197 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, SQL, YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > > SPARK-42452 removed support for Hadoop2, we can clean up the code written for > compatibility with Hadoop 2 to make it more concise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-43201: - Component/s: SQL (was: Spark Core) > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema to use dataframe column but > takes a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) > parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43207) Add helper functions for extract value from literal expression
Ruifeng Zheng created SPARK-43207: - Summary: Add helper functions for extract value from literal expression Key: SPARK-43207 URL: https://issues.apache.org/jira/browse/SPARK-43207 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43206) Streaming query exception() also include stack trace
Wei Liu created SPARK-43206: --- Summary: Streaming query exception() also include stack trace Key: SPARK-43206 URL: https://issues.apache.org/jira/browse/SPARK-43206 Project: Spark Issue Type: Task Components: Connect, Structured Streaming Affects Versions: 3.5.0 Environment: https://github.com/apache/spark/pull/40785#issuecomment-1515522281 Reporter: Wei Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43129) Scala Core API for Streaming Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43129. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40783 [https://github.com/apache/spark/pull/40783] > Scala Core API for Streaming Spark Connect > -- > > Key: SPARK-43129 > URL: https://issues.apache.org/jira/browse/SPARK-43129 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > Fix For: 3.5.0 > > > Scala client API for streaming spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43129) Scala Core API for Streaming Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43129: Assignee: Raghu Angadi > Scala Core API for Streaming Spark Connect > -- > > Key: SPARK-43129 > URL: https://issues.apache.org/jira/browse/SPARK-43129 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Raghu Angadi >Assignee: Raghu Angadi >Priority: Major > > Scala client API for streaming spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42945: Assignee: Allison Wang > Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect > --- > > Key: SPARK-42945 > URL: https://issues.apache.org/jira/browse/SPARK-42945 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > > Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42945) Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-42945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42945. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40575 [https://github.com/apache/spark/pull/40575] > Support PYSPARK_JVM_STACKTRACE_ENABLED in Spark Connect > --- > > Key: SPARK-42945 > URL: https://issues.apache.org/jira/browse/SPARK-42945 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Fix For: 3.5.0 > > > Make the PySpark setting PYSPARK_JVM_STACKTRACE_ENABLED work with Spark > Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43205) Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier
Serge Rielau created SPARK-43205: Summary: Add an IDENTIFIER(stringLiteral) clause that maps a string to an identifier Key: SPARK-43205 URL: https://issues.apache.org/jira/browse/SPARK-43205 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.5.0 Reporter: Serge Rielau There is a requirement for SQL templates, where the table and or column names are provided through substitution. This can be done today using variable substitution: SET hivevar:tabname = mytab; SELECT * FROM ${ hivevar:tabname }; A straight variable substitution is dangerous since it does allow for SQL injection: SET hivevar:tabname = mytab, someothertab; SELECT * FROM ${ hivevar:tabname }; A way to get around this problem is to wrap the variable substitution with a clause that limits the scope t produce an identifier. This approach is taken by Snowflake: [https://docs.snowflake.com/en/sql-reference/session-variables#using-variables-in-sql] SET hivevar:tabname = 'tabname'; SELECT * FROM IDENTIFIER(${ hivevar:tabname }) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43204) Align MERGE assignments with table attributes
Anton Okolnychyi created SPARK-43204: Summary: Align MERGE assignments with table attributes Key: SPARK-43204 URL: https://issues.apache.org/jira/browse/SPARK-43204 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: Anton Okolnychyi Similar to SPARK-42151, we need to do the same for MERGE assignments. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43202) Replace reflection w/ direct calling for YARN Resource API
Cheng Pan created SPARK-43202: - Summary: Replace reflection w/ direct calling for YARN Resource API Key: SPARK-43202 URL: https://issues.apache.org/jira/browse/SPARK-43202 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43203) Fix DROP table behavior in session catalog
Anton Okolnychyi created SPARK-43203: Summary: Fix DROP table behavior in session catalog Key: SPARK-43203 URL: https://issues.apache.org/jira/browse/SPARK-43203 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Anton Okolnychyi DROP table behavior is not working correctly in 3.4.0 because we always invoke V1 drop logic if the identifier looks like a V1 identifier. This is a big blockers for external data sources that provide custom session catalogs. See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43203) Fix DROP table behavior in session catalog
[ https://issues.apache.org/jira/browse/SPARK-43203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Okolnychyi updated SPARK-43203: - Description: DROP table behavior is not working correctly in 3.4.0 because we always invoke V1 drop logic if the identifier looks like a V1 identifier. This is a big blocker for external data sources that provide custom session catalogs. See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for details. was: DROP table behavior is not working correctly in 3.4.0 because we always invoke V1 drop logic if the identifier looks like a V1 identifier. This is a big blockers for external data sources that provide custom session catalogs. See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for details. > Fix DROP table behavior in session catalog > -- > > Key: SPARK-43203 > URL: https://issues.apache.org/jira/browse/SPARK-43203 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Priority: Major > > DROP table behavior is not working correctly in 3.4.0 because we always > invoke V1 drop logic if the identifier looks like a V1 identifier. This is a > big blocker for external data sources that provide custom session catalogs. > See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for > details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-43201: - Description: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} was: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema to use dataframe column but > takes a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", > $"schema").as("parsedData"))parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Adetiloye updated SPARK-43201: - Description: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} was: Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema to use dataframe column but > takes a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) > parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43201) Inconsistency between from_avro and from_json function
Philip Adetiloye created SPARK-43201: Summary: Inconsistency between from_avro and from_json function Key: SPARK-43201 URL: https://issues.apache.org/jira/browse/SPARK-43201 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.4.0 Reporter: Philip Adetiloye Spark from_avro function does not allow schema to use dataframe column but takes a String schema: {code:java} def from_avro(col: Column, jsonFormatSchema: String): Column {code} This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. Here is what I would expect: {code:java} def from_avro(col: Column, jsonFormatSchema: Column): Column {code} code example: {code:java} import org.apache.spark.sql.functions.from_avro val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" val df = Seq( (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) ).toDF("binaryData", "schema") val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show() // Output: // ++ // | parsedData| // ++ // |[apple1, 1.0]| // |[apple2, 2.0]| // ++ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43200) Remove Hadoop 2 reference in docs
Cheng Pan created SPARK-43200: - Summary: Remove Hadoop 2 reference in docs Key: SPARK-43200 URL: https://issues.apache.org/jira/browse/SPARK-43200 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43187) Remove workaround for MiniKdc's BindException
[ https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43187. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40849 [https://github.com/apache/spark/pull/40849] > Remove workaround for MiniKdc's BindException > - > > Key: SPARK-43187 > URL: https://issues.apache.org/jira/browse/SPARK-43187 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43187) Remove workaround for MiniKdc's BindException
[ https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43187: Assignee: Cheng Pan > Remove workaround for MiniKdc's BindException > - > > Key: SPARK-43187 > URL: https://issues.apache.org/jira/browse/SPARK-43187 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43186) Remove workaround for FileSinkDesc
[ https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-43186. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40848 [https://github.com/apache/spark/pull/40848] > Remove workaround for FileSinkDesc > -- > > Key: SPARK-43186 > URL: https://issues.apache.org/jira/browse/SPARK-43186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43186) Remove workaround for FileSinkDesc
[ https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-43186: Assignee: Cheng Pan > Remove workaround for FileSinkDesc > -- > > Key: SPARK-43186 > URL: https://issues.apache.org/jira/browse/SPARK-43186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43142) DSL expressions fail on attribute with special characters
[ https://issues.apache.org/jira/browse/SPARK-43142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714209#comment-17714209 ] Hudson commented on SPARK-43142: User 'rshkv' has created a pull request for this issue: https://github.com/apache/spark/pull/40794 > DSL expressions fail on attribute with special characters > - > > Key: SPARK-43142 > URL: https://issues.apache.org/jira/browse/SPARK-43142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Willi Raschkowski >Priority: Major > > Expressions on implicitly converted attributes fail if the attributes have > names containing special characters. They fail even if the attributes are > backtick-quoted: > {code:java} > scala> import org.apache.spark.sql.catalyst.dsl.expressions._ > import org.apache.spark.sql.catalyst.dsl.expressions._ > scala> "`slashed/col`".attr > res0: org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute = > 'slashed/col > scala> "`slashed/col`".attr.asc > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '/' expecting {, '.', '-'}(line 1, pos 7) > == SQL == > slashed/col > ---^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43124) Dataset.show should not trigger job execution on CommandResults
[ https://issues.apache.org/jira/browse/SPARK-43124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714211#comment-17714211 ] Hudson commented on SPARK-43124: User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/40779 > Dataset.show should not trigger job execution on CommandResults > --- > > Key: SPARK-43124 > URL: https://issues.apache.org/jira/browse/SPARK-43124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Peter Toth >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.
[ https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714214#comment-17714214 ] Hudson commented on SPARK-43137: User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/40789 > Improve ArrayInsert if the position is foldable and equals to zero. > --- > > Key: SPARK-43137 > URL: https://issues.apache.org/jira/browse/SPARK-43137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > > We want make array_prepend reuse the implementation of array_insert, but it > seems a bit performance worse if the position is foldable and equals to zero. > The reason is that always do the check for position is negative or positive, > and the code is too long. Too long code will lead to JIT failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43160) Remove typing.io namespace references as it is being removed
[ https://issues.apache.org/jira/browse/SPARK-43160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714212#comment-17714212 ] Hudson commented on SPARK-43160: User 'aimtsou' has created a pull request for this issue: https://github.com/apache/spark/pull/40819 > Remove typing.io namespace references as it is being removed > > > Key: SPARK-43160 > URL: https://issues.apache.org/jira/browse/SPARK-43160 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.4, 3.3.2, 3.4.0 >Reporter: Aimilios Tsouvelekakis >Priority: Minor > > Python 3.11 gives a deprecation warning to the following: > {code:java} > /python/3.11.1/lib/python3.11/site-packages/pyspark/broadcast.py:38: > DeprecationWarning: typing.io is deprecated, import directly from typing > instead. typing.io will be removed in Python 3.12. > from typing.io import BinaryIO # type: ignore[import]{code} > The only reference comes from: > {code:java} > spark % git grep typing.io > python/pyspark/broadcast.py:from typing.io import BinaryIO # type: > ignore[import] {code} > I will fix the import so it does not cause any deprecation problem. > > This is documeted in [1|[https://bugs.python.org/issue35089]], > [2|https://docs.python.org/3/library/typing.html#typing.IO] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43199) Make InlineCTE idempotent
Peter Toth created SPARK-43199: -- Summary: Make InlineCTE idempotent Key: SPARK-43199 URL: https://issues.apache.org/jira/browse/SPARK-43199 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Peter Toth -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43198) Fix "Could not initialise class ammonite..." error when using filter
[ https://issues.apache.org/jira/browse/SPARK-43198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata Sai Akhil Gudesa updated SPARK-43198: - Description: When {code:java} spark.range(10).filter(n => n % 2 == 0).collectAsList()`{code} is run in the ammonite REPL (Spark Connect), the following error is thrown: {noformat} io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$ io.grpc.Status.asRuntimeException(Status.java:535) io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62) org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114) org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131) org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687) org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088) org.apache.spark.sql.Dataset.collect(Dataset.scala:2686) org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700) ammonite.$sess.cmd0$.(cmd0.sc:1) ammonite.$sess.cmd0$.(cmd0.sc){noformat} was: When `spark.range(10).filter(n => n % 2 == 0).collectAsList()` is run in the ammonite REPL (Spark Connect), the following error is thrown: ``` io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$ io.grpc.Status.asRuntimeException(Status.java:535) io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62) org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114) org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131) org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687) org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088) org.apache.spark.sql.Dataset.collect(Dataset.scala:2686) org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700) ammonite.$sess.cmd0$.(cmd0.sc:1) ammonite.$sess.cmd0$.(cmd0.sc) ``` > Fix "Could not initialise class ammonite..." error when using filter > > > Key: SPARK-43198 > URL: https://issues.apache.org/jira/browse/SPARK-43198 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.5.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > > When > {code:java} > spark.range(10).filter(n => n % 2 == 0).collectAsList()`{code} > is run in the ammonite REPL (Spark Connect), the following error is thrown: > {noformat} > io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$ > io.grpc.Status.asRuntimeException(Status.java:535) > > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > > org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62) > > org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114) > > org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131) > org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687) > org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088) > org.apache.spark.sql.Dataset.collect(Dataset.scala:2686) > org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700) > ammonite.$sess.cmd0$.(cmd0.sc:1) > ammonite.$sess.cmd0$.(cmd0.sc){noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43198) Fix "Could not initialise class ammonite..." error when using filter
Venkata Sai Akhil Gudesa created SPARK-43198: Summary: Fix "Could not initialise class ammonite..." error when using filter Key: SPARK-43198 URL: https://issues.apache.org/jira/browse/SPARK-43198 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Venkata Sai Akhil Gudesa When `spark.range(10).filter(n => n % 2 == 0).collectAsList()` is run in the ammonite REPL (Spark Connect), the following error is thrown: ``` io.grpc.StatusRuntimeException: UNKNOWN: ammonite/repl/ReplBridge$ io.grpc.Status.asRuntimeException(Status.java:535) io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) org.apache.spark.sql.connect.client.SparkResult.org$apache$spark$sql$connect$client$SparkResult$$processResponses(SparkResult.scala:62) org.apache.spark.sql.connect.client.SparkResult.length(SparkResult.scala:114) org.apache.spark.sql.connect.client.SparkResult.toArray(SparkResult.scala:131) org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2687) org.apache.spark.sql.Dataset.withResult(Dataset.scala:3088) org.apache.spark.sql.Dataset.collect(Dataset.scala:2686) org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2700) ammonite.$sess.cmd0$.(cmd0.sc:1) ammonite.$sess.cmd0$.(cmd0.sc) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43179) Add option for applications to control saving of metadata in the External Shuffle Service LevelDB
[ https://issues.apache.org/jira/browse/SPARK-43179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714149#comment-17714149 ] Ignite TC Bot commented on SPARK-43179: --- User 'otterc' has created a pull request for this issue: https://github.com/apache/spark/pull/40843 > Add option for applications to control saving of metadata in the External > Shuffle Service LevelDB > - > > Key: SPARK-43179 > URL: https://issues.apache.org/jira/browse/SPARK-43179 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.4.0 >Reporter: Chandni Singh >Priority: Major > > Currently, the External Shuffle Service stores application metadata in > LevelDB. This is necessary to enable the shuffle server to resume serving > shuffle data for an application whose executors registered before the > NodeManager restarts. However, the metadata includes the application secret, > which is stored in LevelDB without encryption. This is a potential security > risk, particularly for applications with high security requirements. While > filesystem access control lists (ACLs) can help protect keys and > certificates, they may not be sufficient for some use cases. In response, we > have decided not to store metadata for these high-security applications in > LevelDB. As a result, these applications may experience more failures in the > event of a node restart, but we believe this trade-off is acceptable given > the increased security risk. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.
[ https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-43137: --- Assignee: jiaan.geng > Improve ArrayInsert if the position is foldable and equals to zero. > --- > > Key: SPARK-43137 > URL: https://issues.apache.org/jira/browse/SPARK-43137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > We want make array_prepend reuse the implementation of array_insert, but it > seems a bit performance worse if the position is foldable and equals to zero. > The reason is that always do the check for position is negative or positive, > and the code is too long. Too long code will lead to JIT failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43137) Improve ArrayInsert if the position is foldable and equals to zero.
[ https://issues.apache.org/jira/browse/SPARK-43137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-43137. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40833 [https://github.com/apache/spark/pull/40833] > Improve ArrayInsert if the position is foldable and equals to zero. > --- > > Key: SPARK-43137 > URL: https://issues.apache.org/jira/browse/SPARK-43137 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > > We want make array_prepend reuse the implementation of array_insert, but it > seems a bit performance worse if the position is foldable and equals to zero. > The reason is that always do the check for position is negative or positive, > and the code is too long. Too long code will lead to JIT failed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37829) An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values
[ https://issues.apache.org/jira/browse/SPARK-37829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37829: --- Assignee: Jason Xu > An outer-join using joinWith on DataFrames returns Rows with null fields > instead of null values > --- > > Key: SPARK-37829 > URL: https://issues.apache.org/jira/browse/SPARK-37829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Clément de Groc >Assignee: Jason Xu >Priority: Major > Fix For: 3.4.1, 3.5.0 > > > Doing an outer-join using {{joinWith}} on {{{}DataFrame{}}}s used to return > missing values as {{null}} in Spark 2.4.8, but returns them as {{Rows}} with > {{null}} values in Spark 3+. > The issue can be reproduced with [the following > test|https://github.com/cdegroc/spark/commit/79f4d6a1ec6c69b10b72dbc8f92ab6490d5ef5e5] > that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0. > The problem only arises when working with DataFrames: Datasets of case > classes work as expected as demonstrated by [this other > test|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L1200-L1223]. > I couldn't find an explanation for this change in the Migration guide so I'm > assuming this is a bug. > A {{git bisect}} pointed me to [that > commit|https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59]. > Reverting the commit solves the problem. > A similar solution, but without reverting, is shown > [here|https://github.com/cdegroc/spark/commit/684c675bf070876a475a9b225f6c2f92edce4c8a]. > Happy to help if you think of another approach / can provide some guidance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37829) An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values
[ https://issues.apache.org/jira/browse/SPARK-37829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37829. - Fix Version/s: 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 40755 [https://github.com/apache/spark/pull/40755] > An outer-join using joinWith on DataFrames returns Rows with null fields > instead of null values > --- > > Key: SPARK-37829 > URL: https://issues.apache.org/jira/browse/SPARK-37829 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0 >Reporter: Clément de Groc >Priority: Major > Fix For: 3.5.0, 3.4.1 > > > Doing an outer-join using {{joinWith}} on {{{}DataFrame{}}}s used to return > missing values as {{null}} in Spark 2.4.8, but returns them as {{Rows}} with > {{null}} values in Spark 3+. > The issue can be reproduced with [the following > test|https://github.com/cdegroc/spark/commit/79f4d6a1ec6c69b10b72dbc8f92ab6490d5ef5e5] > that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0. > The problem only arises when working with DataFrames: Datasets of case > classes work as expected as demonstrated by [this other > test|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L1200-L1223]. > I couldn't find an explanation for this change in the Migration guide so I'm > assuming this is a bug. > A {{git bisect}} pointed me to [that > commit|https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59]. > Reverting the commit solves the problem. > A similar solution, but without reverting, is shown > [here|https://github.com/cdegroc/spark/commit/684c675bf070876a475a9b225f6c2f92edce4c8a]. > Happy to help if you think of another approach / can provide some guidance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43184) Resume using enumeration to compare `NodeState.DECOMMISSIONING`
[ https://issues.apache.org/jira/browse/SPARK-43184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43184: - Parent: SPARK-43197 Issue Type: Sub-task (was: Improvement) > Resume using enumeration to compare `NodeState.DECOMMISSIONING` > > > Key: SPARK-43184 > URL: https://issues.apache.org/jira/browse/SPARK-43184 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43185) Inline `hadoop-client` related properties in `pom.xml`
[ https://issues.apache.org/jira/browse/SPARK-43185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43185: - Parent: SPARK-43197 Issue Type: Sub-task (was: Improvement) > Inline `hadoop-client` related properties in `pom.xml` > -- > > Key: SPARK-43185 > URL: https://issues.apache.org/jira/browse/SPARK-43185 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43186) Remove workaround for FileSinkDesc
[ https://issues.apache.org/jira/browse/SPARK-43186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43186: - Parent: SPARK-43197 Issue Type: Sub-task (was: Improvement) > Remove workaround for FileSinkDesc > -- > > Key: SPARK-43186 > URL: https://issues.apache.org/jira/browse/SPARK-43186 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43187) Remove workaround for MiniKdc's BindException
[ https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43187: - Parent: SPARK-43197 Issue Type: Sub-task (was: Test) > Remove workaround for MiniKdc's BindException > - > > Key: SPARK-43187 > URL: https://issues.apache.org/jira/browse/SPARK-43187 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext
[ https://issues.apache.org/jira/browse/SPARK-43191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43191: - Parent: SPARK-43197 Issue Type: Sub-task (was: Improvement) > Replace reflection w/ direct calling for Hadoop CallerContext > -- > > Key: SPARK-43191 > URL: https://issues.apache.org/jira/browse/SPARK-43191 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43193) Remove workaround for HADOOP-12074
[ https://issues.apache.org/jira/browse/SPARK-43193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43193: - Parent: SPARK-43197 Issue Type: Sub-task (was: Improvement) > Remove workaround for HADOOP-12074 > -- > > Key: SPARK-43193 > URL: https://issues.apache.org/jira/browse/SPARK-43193 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils
[ https://issues.apache.org/jira/browse/SPARK-43195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43195: - Parent: SPARK-43197 Issue Type: Sub-task (was: Improvement) > Remove unnecessary serializable wrapper in HadoopFSUtils > > > Key: SPARK-43195 > URL: https://issues.apache.org/jira/browse/SPARK-43195 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`
[ https://issues.apache.org/jira/browse/SPARK-43196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-43196: - Parent: SPARK-43197 Issue Type: Sub-task (was: Improvement) > Replace reflection w/ direct calling for > `ContainerLaunchContext#setTokensConf` > --- > > Key: SPARK-43196 > URL: https://issues.apache.org/jira/browse/SPARK-43196 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714114#comment-17714114 ] Peter Toth edited comment on SPARK-24497 at 4/19/23 2:00 PM: - I've opened a new PR: https://github.com/apache/spark/pull/40744 to support recursive SQL, but for some reason it didn't get automatically linked here. [~gurwls223], you might know what went wrong... was (Author: petertoth): I've opened a new PR: https://github.com/apache/spark/pull/40093 to support recursive SQL, but for some reason it didn't get automatically linked here. [~gurwls223], you might know what went wrong... > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43197) Clean up the code written for compatibility with Hadoop 2
Yang Jie created SPARK-43197: Summary: Clean up the code written for compatibility with Hadoop 2 Key: SPARK-43197 URL: https://issues.apache.org/jira/browse/SPARK-43197 Project: Spark Issue Type: Umbrella Components: Spark Core, SQL, YARN Affects Versions: 3.5.0 Reporter: Yang Jie SPARK-42452 removed support for Hadoop2, we can clean up the code written for compatibility with Hadoop 2 to make it more concise -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714114#comment-17714114 ] Peter Toth commented on SPARK-24497: I've opened a new PR: https://github.com/apache/spark/pull/40093 to support recursive SQL, but for some reason it didn't get automatically linked here. [~gurwls223], you might know what went wrong... > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43196) Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf`
Yang Jie created SPARK-43196: Summary: Replace reflection w/ direct calling for `ContainerLaunchContext#setTokensConf` Key: SPARK-43196 URL: https://issues.apache.org/jira/browse/SPARK-43196 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43195) Remove unnecessary serializable wrapper in HadoopFSUtils
Cheng Pan created SPARK-43195: - Summary: Remove unnecessary serializable wrapper in HadoopFSUtils Key: SPARK-43195 URL: https://issues.apache.org/jira/browse/SPARK-43195 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43194) PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0
Phillip Cloud created SPARK-43194: - Summary: PySpark 3.4.0 cannot convert timestamp-typed objects to pandas with pandas 2.0 Key: SPARK-43194 URL: https://issues.apache.org/jira/browse/SPARK-43194 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.4.0 Environment: {code} In [4]: import pandas as pd In [5]: pd.__version__ Out[5]: '2.0.0' In [6]: import pyspark as ps In [7]: ps.__version__ Out[7]: '3.4.0' {code} Reporter: Phillip Cloud {code} In [1]: from pyspark.sql import SparkSession In [2]: session = SparkSession.builder.appName("test").getOrCreate() 23/04/19 09:21:42 WARN Utils: Your hostname, albatross resolves to a loopback address: 127.0.0.2; using 192.168.1.170 instead (on interface enp5s0) 23/04/19 09:21:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/04/19 09:21:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable In [3]: session.sql("select now()").toPandas() {code} Results in: {code} ... TypeError: Casting to unit-less dtype 'datetime64' is not supported. Pass e.g. 'datetime64[ns]' instead. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43193) Remove workaround for HADOOP-12074
Cheng Pan created SPARK-43193: - Summary: Remove workaround for HADOOP-12074 Key: SPARK-43193 URL: https://issues.apache.org/jira/browse/SPARK-43193 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43191) Replace reflection w/ direct calling for Hadoop CallerContext
Cheng Pan created SPARK-43191: - Summary: Replace reflection w/ direct calling for Hadoop CallerContext Key: SPARK-43191 URL: https://issues.apache.org/jira/browse/SPARK-43191 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43192) Spark connect's user agent validations are too restrictive
Niranjan Jayakar created SPARK-43192: Summary: Spark connect's user agent validations are too restrictive Key: SPARK-43192 URL: https://issues.apache.org/jira/browse/SPARK-43192 Project: Spark Issue Type: Bug Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Niranjan Jayakar The current restriction on allowed charset and length are too restrictive https://github.com/apache/spark/blob/cac6f58318bb84d532f02d245a50d3c66daa3e4b/python/pyspark/sql/connect/client.py#L274-L275 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43190) ListQuery.childOutput should be consistent with child output
Wenchen Fan created SPARK-43190: --- Summary: ListQuery.childOutput should be consistent with child output Key: SPARK-43190 URL: https://issues.apache.org/jira/browse/SPARK-43190 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"
[ https://issues.apache.org/jira/browse/SPARK-43189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Grigorev updated SPARK-43189: Description: h2. Issue Users who have mypy enabled in their IDE or CI environment face very verbose error messages when using the {{pandas_udf}} function in PySpark. The current typing of the {{pandas_udf}} function seems to be causing these issues. As a workaround, the official documentation provides examples that use {{{}# type: ignore[call-overload]{}}}, but this is not an ideal solution. h2. Example Here's a code snippet taken from [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs] that triggers the error when mypy is enabled: {code:python} from pyspark.sql.functions import pandas_udf import pandas as pd @pandas_udf("col1 string, col2 long") def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame: s3['col2'] = s1 + s2.str.len() return s3 {code} Running mypy on this code results in a long and verbose error message, which makes it difficult for users to understand the actual issue and how to resolve it. h2. Proposed Solution We kindly request the PySpark development team to review and improve the typing for the {{pandas_udf}} function to prevent these verbose error messages from appearing. This improvement will help users who have mypy enabled in their development environments to have a better experience when using PySpark. Furthermore, we suggest updating the official documentation to provide better examples that do not rely on {{# type: ignore[call-overload]}} to suppress these errors. h2. Impact By addressing this issue, users of PySpark with mypy enabled in their development environment will be able to write and verify their code more efficiently, without being overwhelmed by verbose error messages. This will lead to a more enjoyable and productive experience when working with PySpark and pandas UDFs. was: h2. Issue Users who have mypy enabled in their IDE or CI environment face very verbose error messages when using the {{pandas_udf}} function in PySpark. The current typing of the {{pandas_udf}} function seems to be causing these issues. As a workaround, the official documentation provides examples that use {{{}# type: ignore[call-overload]{}}}, but this is not an ideal solution. h2. Example Here's a code snippet that triggers the error when mypy is enabled: {code:python} from pyspark.sql.functions import pandas_udf import pandas as pd @pandas_udf("string") def f(s: pd.Series) -> pd.Series: return pd.Series(["a"]*len(s), index=s.index) {code} Running mypy on this code results in a long and verbose error message, which makes it difficult for users to understand the actual issue and how to resolve it. h2. Proposed Solution We kindly request the PySpark development team to review and improve the typing for the {{pandas_udf}} function to prevent these verbose error messages from appearing. This improvement will help users who have mypy enabled in their development environments to have a better experience when using PySpark. Furthermore, we suggest updating the official documentation to provide better examples that do not rely on {{# type: ignore[call-overload]}} to suppress these errors. h2. Impact By addressing this issue, users of PySpark with mypy enabled in their development environment will be able to write and verify their code more efficiently, without being overwhelmed by verbose error messages. This will lead to a more enjoyable and productive experience when working with PySpark and pandas UDFs. > No overload variant of "pandas_udf" matches argument type "str" > --- > > Key: SPARK-43189 > URL: https://issues.apache.org/jira/browse/SPARK-43189 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.2.4, 3.3.2, 3.4.0 >Reporter: Andrew Grigorev >Priority: Major > > h2. Issue > Users who have mypy enabled in their IDE or CI environment face very verbose > error messages when using the {{pandas_udf}} function in PySpark. The current > typing of the {{pandas_udf}} function seems to be causing these issues. As a > workaround, the official documentation provides examples that use {{{}# type: > ignore[call-overload]{}}}, but this is not an ideal solution. > h2. Example > Here's a code snippet taken from > [docs|https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs] > that triggers the error when mypy is enabled: > {code:python} > from pyspark.sql.functions import pandas_udf > import pandas as pd > @pandas_udf("col1 string, col2 long") > def func(s1: pd.Series, s2: pd.Series, s3: pd.DataFrame) -> pd.DataFrame:
[jira] [Created] (SPARK-43189) No overload variant of "pandas_udf" matches argument type "str"
Andrew Grigorev created SPARK-43189: --- Summary: No overload variant of "pandas_udf" matches argument type "str" Key: SPARK-43189 URL: https://issues.apache.org/jira/browse/SPARK-43189 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0, 3.3.2, 3.2.4 Reporter: Andrew Grigorev h2. Issue Users who have mypy enabled in their IDE or CI environment face very verbose error messages when using the {{pandas_udf}} function in PySpark. The current typing of the {{pandas_udf}} function seems to be causing these issues. As a workaround, the official documentation provides examples that use {{{}# type: ignore[call-overload]{}}}, but this is not an ideal solution. h2. Example Here's a code snippet that triggers the error when mypy is enabled: {code:python} from pyspark.sql.functions import pandas_udf import pandas as pd @pandas_udf("string") def f(s: pd.Series) -> pd.Series: return pd.Series(["a"]*len(s), index=s.index) {code} Running mypy on this code results in a long and verbose error message, which makes it difficult for users to understand the actual issue and how to resolve it. h2. Proposed Solution We kindly request the PySpark development team to review and improve the typing for the {{pandas_udf}} function to prevent these verbose error messages from appearing. This improvement will help users who have mypy enabled in their development environments to have a better experience when using PySpark. Furthermore, we suggest updating the official documentation to provide better examples that do not rely on {{# type: ignore[call-overload]}} to suppress these errors. h2. Impact By addressing this issue, users of PySpark with mypy enabled in their development environment will be able to write and verify their code more efficiently, without being overwhelmed by verbose error messages. This will lead to a more enjoyable and productive experience when working with PySpark and pandas UDFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43188) Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-43188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas PHUNG updated SPARK-43188: -- Description: Hello, I have an issue with Spark 3.3.2 & Spark 3.4.0 to write into Azure Data Lake Storage Gen2 (abfs/abfss scheme). I've got the following errors: {code:java} warn 13:12:47.554: StdErr from Kernel Process 23/04/19 13:12:47 ERROR FileFormatWriter: Aborting job 6a75949c-1473-4445-b8ab-d125be3f0f21.org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1) (myhost executor driver): org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for datablock-0001- at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:462) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:165) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.createTmpFileForWrite(DataBlocks.java:980) at org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.create(DataBlocks.java:960) at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.createBlockIfNeeded(AbfsOutputStream.java:262) at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.(AbfsOutputStream.java:173) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createFile(AzureBlobFileSystemStore.java:580) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.create(AzureBlobFileSystem.java:301) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:347) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:314) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:480) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$$anon$1.newInstance(ParquetUtils.scala:490) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:389) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2720) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1206) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:
[jira] [Created] (SPARK-43188) Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2
Nicolas PHUNG created SPARK-43188: - Summary: Cannot write to Azure Datalake Gen2 (abfs/abfss) after Spark 3.1.2 Key: SPARK-43188 URL: https://issues.apache.org/jira/browse/SPARK-43188 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 3.4.0, 3.3.2 Reporter: Nicolas PHUNG Hello, I have an issue with Spark 3.3.2 & Spark 3.4.0 to write into Azure Data Lake Storage Gen2 (abfs/abfss scheme). I've got the following errors: {code:java} warn 13:12:47.554: StdErr from Kernel Process 23/04/19 13:12:47 ERROR FileFormatWriter: Aborting job 6a75949c-1473-4445-b8ab-d125be3f0f21.org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1) (FR07258024L.dsk.eur.msd.world.socgen executor driver): org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for datablock-0001- at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:462) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:165) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.createTmpFileForWrite(DataBlocks.java:980) at org.apache.hadoop.fs.store.DataBlocks$DiskBlockFactory.create(DataBlocks.java:960) at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.createBlockIfNeeded(AbfsOutputStream.java:262) at org.apache.hadoop.fs.azurebfs.services.AbfsOutputStream.(AbfsOutputStream.java:173) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createFile(AzureBlobFileSystemStore.java:580) at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.create(AzureBlobFileSystem.java:301) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1195) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1175) at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:347) at org.apache.parquet.hadoop.ParquetFileWriter.(ParquetFileWriter.java:314) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:480) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:420) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:409) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$$anon$1.newInstance(ParquetUtils.scala:490) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:146) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:389) at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2785) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2721) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2720) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGSche
[jira] [Commented] (SPARK-43187) Remove workaround for MiniKdc's BindException
[ https://issues.apache.org/jira/browse/SPARK-43187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17714071#comment-17714071 ] Ignite TC Bot commented on SPARK-43187: --- User 'pan3793' has created a pull request for this issue: https://github.com/apache/spark/pull/40849 > Remove workaround for MiniKdc's BindException > - > > Key: SPARK-43187 > URL: https://issues.apache.org/jira/browse/SPARK-43187 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.5.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43187) Remove workaround for MiniKdc's BindException
Cheng Pan created SPARK-43187: - Summary: Remove workaround for MiniKdc's BindException Key: SPARK-43187 URL: https://issues.apache.org/jira/browse/SPARK-43187 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43186) Remove workaround for FileSinkDesc
Cheng Pan created SPARK-43186: - Summary: Remove workaround for FileSinkDesc Key: SPARK-43186 URL: https://issues.apache.org/jira/browse/SPARK-43186 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43185) Inline `hadoop-client` related properties in `pom.xml`
Yang Jie created SPARK-43185: Summary: Inline `hadoop-client` related properties in `pom.xml` Key: SPARK-43185 URL: https://issues.apache.org/jira/browse/SPARK-43185 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43176) Deduplicate imports in Connect Tests
[ https://issues.apache.org/jira/browse/SPARK-43176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-43176. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40839 [https://github.com/apache/spark/pull/40839] > Deduplicate imports in Connect Tests > > > Key: SPARK-43176 > URL: https://issues.apache.org/jira/browse/SPARK-43176 > Project: Spark > Issue Type: Test > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43176) Deduplicate imports in Connect Tests
[ https://issues.apache.org/jira/browse/SPARK-43176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-43176: Assignee: Ruifeng Zheng > Deduplicate imports in Connect Tests > > > Key: SPARK-43176 > URL: https://issues.apache.org/jira/browse/SPARK-43176 > Project: Spark > Issue Type: Test > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43184) Resume using enumeration to compare `NodeState.DECOMMISSIONING`
Yang Jie created SPARK-43184: Summary: Resume using enumeration to compare `NodeState.DECOMMISSIONING` Key: SPARK-43184 URL: https://issues.apache.org/jira/browse/SPARK-43184 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.5.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43183) Move update event on idleness in streaming query listener to separate callback method
Jungtaek Lim created SPARK-43183: Summary: Move update event on idleness in streaming query listener to separate callback method Key: SPARK-43183 URL: https://issues.apache.org/jira/browse/SPARK-43183 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.5.0 Reporter: Jungtaek Lim People has been having a lot of confusions about update event on idleness; it’s not only the matter of understanding but also comes up with various types of complaints. For example, since we give the latest batch ID for update event on idleness, if the listener implementation blindly performs upsert based on batch ID, they are in risk to lose metrics. This also complicates the logic because we have to memorize the execution for the previous batch, which is arguably not necessary. Because of this, we’d be better to move the idle event out of progress update event and have separate callback method for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Description: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the part-m-***.zip attachment, and put these files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; {code} was: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSize
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Attachment: part-m-9.zip part-m-8.zip part-m-7.zip part-m-6.zip part-m-5.zip part-m-4.zip part-m-3.zip part-m-2.zip part-m-00016.zip part-m-00015.zip part-m-00014.zip part-m-00013.zip part-m-00012.zip part-m-00011.zip part-m-00010.zip part-m-1.zip part-m-0.zip part-m-00019.zip part-m-00018.zip part-m-00017.zip > Mutilple tables join with limit when AE is enabled and one table is skewed > -- > > Key: SPARK-43182 > URL: https://issues.apache.org/jira/browse/SPARK-43182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Liu Shuo >Priority: Critical > Attachments: part-m-0.zip, part-m-1.zip, part-m-2.zip, > part-m-3.zip, part-m-4.zip, part-m-5.zip, part-m-6.zip, > part-m-7.zip, part-m-8.zip, part-m-9.zip, part-m-00010.zip, > part-m-00011.zip, part-m-00012.zip, part-m-00013.zip, part-m-00014.zip, > part-m-00015.zip, part-m-00016.zip, part-m-00017.zip, part-m-00018.zip, > part-m-00019.zip > > > When we test AE in Spark3.4.0 with the following case, we find If we disable > AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we > enable AE and enable skewJoin,it will take very long time. > The test case: > {code:java} > ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' > dir. > create table source_aqe(c1 int,c18 string) using csv options(path > 'file:///tmp/spark-warehouse/data/'); > create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned > by(c18 string); > insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from > source_aqe; > insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from > source_aqe limit 12; > insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from > source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from > source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from > source_aqe limit 12; > set spark.sql.adaptive.enabled=false; > set spark.sql.adaptive.forceOptimizeSkewedJoin = false; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > > ###it will finish in 20s > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > set spark.sql.adaptive.enabled=true; > set spark.sql.adaptive.forceOptimizeSkewedJoin = true; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > ###it will take very long time > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Attachment: (was: part-m-0.zip) > Mutilple tables join with limit when AE is enabled and one table is skewed > -- > > Key: SPARK-43182 > URL: https://issues.apache.org/jira/browse/SPARK-43182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Liu Shuo >Priority: Critical > > When we test AE in Spark3.4.0 with the following case, we find If we disable > AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we > enable AE and enable skewJoin,it will take very long time. > The test case: > {code:java} > ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' > dir. > create table source_aqe(c1 int,c18 string) using csv options(path > 'file:///tmp/spark-warehouse/data/'); > create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned > by(c18 string); > insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from > source_aqe; > insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from > source_aqe limit 12; > insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from > source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from > source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from > source_aqe limit 12; > set spark.sql.adaptive.enabled=false; > set spark.sql.adaptive.forceOptimizeSkewedJoin = false; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > > ###it will finish in 20s > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > set spark.sql.adaptive.enabled=true; > set spark.sql.adaptive.forceOptimizeSkewedJoin = true; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > ###it will take very long time > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43181) spark-sql console should display the Spark WEB UI address
[ https://issues.apache.org/jira/browse/SPARK-43181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713991#comment-17713991 ] ASF GitHub Bot commented on SPARK-43181: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/40844 > spark-sql console should display the Spark WEB UI address > - > > Key: SPARK-43181 > URL: https://issues.apache.org/jira/browse/SPARK-43181 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Attachment: part-m-0.zip > Mutilple tables join with limit when AE is enabled and one table is skewed > -- > > Key: SPARK-43182 > URL: https://issues.apache.org/jira/browse/SPARK-43182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Liu Shuo >Priority: Critical > Attachments: part-m-0.zip > > > When we test AE in Spark3.4.0 with the following case, we find If we disable > AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we > enable AE and enable skewJoin,it will take very long time. > The test case: > {code:java} > ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' > dir. > create table source_aqe(c1 int,c18 string) using csv options(path > 'file:///tmp/spark-warehouse/data/'); > create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned > by(c18 string); > insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from > source_aqe; > insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from > source_aqe limit 12; > insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from > source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from > source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from > source_aqe limit 12; > set spark.sql.adaptive.enabled=false; > set spark.sql.adaptive.forceOptimizeSkewedJoin = false; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > > ###it will finish in 20s > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > set spark.sql.adaptive.enabled=true; > set spark.sql.adaptive.forceOptimizeSkewedJoin = true; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > ###it will take very long time > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Description: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; {code} was: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set sp
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Description: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; {code} was: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set s
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Description: [^data.rar][^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code} was: [^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitio
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Description: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code} was: [^data.rar][^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInByte
[jira] [Commented] (SPARK-42869) can not analyze window exp on sub query
[ https://issues.apache.org/jira/browse/SPARK-42869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713987#comment-17713987 ] GuangWeiHong commented on SPARK-42869: -- OK, thanks > can not analyze window exp on sub query > --- > > Key: SPARK-42869 > URL: https://issues.apache.org/jira/browse/SPARK-42869 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: GuangWeiHong >Priority: Major > Attachments: image-2023-03-20-18-00-40-578.png, > image-2023-04-17-19-06-28-069.png, image-2023-04-17-19-09-41-485.png > > > > CREATE TABLE test_noindex_table(`name` STRING,`age` INT,`city` STRING) > PARTITIONED BY (`date` STRING); > > SELECT > * > FROM > ( > SELECT *, COUNT(1) OVER itr AS grp_size > FROM test_noindex_table > WINDOW itr AS (PARTITION BY city) > ) tbl > WINDOW itr2 AS (PARTITION BY > city > ) > > Window specification itr is not defined in the WINDOW clause. > !image-2023-03-20-18-00-40-578.png|width=560,height=361! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Description: [^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ###uncompress the data.zip, and put files under '/tmp/spark-warehouse/data/' dir. create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code} was: [^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ### create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Description: [^data.zip]When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} ### create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code} was: When we test AE in Spark3.4.0 with the following case, we find If we disable AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we enable AE and enable skewJoin,it will take very long time. The test case: {code:java} create table source_aqe(c1 int,c18 string) using csv options(path 'file:///tmp/spark-warehouse/data/'); create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from source_aqe; insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from source_aqe limit 12; insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as PARQUET partitioned by(c18 string); insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from source_aqe limit 16; insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from source_aqe limit 12; set spark.sql.adaptive.enabled=false; set spark.sql.adaptive.forceOptimizeSkewedJoin = false; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will finish in 20s select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join hive_snappy_aqe_table3 on hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; set spark.sql.adaptive.enabled=true; set spark.sql.adaptive.forceOptimizeSkewedJoin = true; set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; set spark.sql.autoBroadcastJoinThreshold = 51200; ###it will take very long time select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on hive_snappy_aqe
[jira] [Updated] (SPARK-43182) Mutilple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Summary: Mutilple tables join with limit when AE is enabled and one table is skewed (was: Mutiple tables join with limit when AE is enabled and one table is skewed) > Mutilple tables join with limit when AE is enabled and one table is skewed > -- > > Key: SPARK-43182 > URL: https://issues.apache.org/jira/browse/SPARK-43182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Liu Shuo >Priority: Critical > > When we test AE in Spark3.4.0 with the following case, we find If we disable > AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we > enable AE and enable skewJoin,it will take very long time. > The test case: > {code:java} > create table source_aqe(c1 int,c18 string) using csv options(path > 'file:///tmp/spark-warehouse/data/'); > create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned > by(c18 string); > insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from > source_aqe; > insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from > source_aqe limit 12; > insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from > source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from > source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from > source_aqe limit 12; > set spark.sql.adaptive.enabled=false; > set spark.sql.adaptive.forceOptimizeSkewedJoin = false; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > > ###it will finish in 20s > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > set spark.sql.adaptive.enabled=true; > set spark.sql.adaptive.forceOptimizeSkewedJoin = true; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > ###it will take very long time > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43182) Mutiple tables join with limit when AE is enabled and one table is skewed
[ https://issues.apache.org/jira/browse/SPARK-43182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liu Shuo updated SPARK-43182: - Summary: Mutiple tables join with limit when AE is enabled and one table is skewed (was: 3 tables join with limit when AE is enabled and one table is skewed) > Mutiple tables join with limit when AE is enabled and one table is skewed > - > > Key: SPARK-43182 > URL: https://issues.apache.org/jira/browse/SPARK-43182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Liu Shuo >Priority: Critical > > When we test AE in Spark3.4.0 with the following case, we find If we disable > AE or enable Ae but disable skewJoin, the sql will finish in 20s, but if we > enable AE and enable skewJoin,it will take very long time. > The test case: > {code:java} > create table source_aqe(c1 int,c18 string) using csv options(path > 'file:///tmp/spark-warehouse/data/'); > create table hive_snappy_aqe_table1(c1 int)stored as PARQUET partitioned > by(c18 string); > insert into table hive_snappy_aqe_table1 partition(c18=1)select c1 from > source_aqe; > insert into table hive_snappy_aqe_table1 partition(c18=2)select c1 from > source_aqe limit 12; > insert into table hive_snappy_aqe_table1 partition(c18=3)select c1 from > source_aqe limit 15;create table hive_snappy_aqe_table2(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table2 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table2 partition(c18=2)select c1 from > source_aqe limit 12;create table hive_snappy_aqe_table3(c1 int)stored as > PARQUET partitioned by(c18 string); > insert into table hive_snappy_aqe_table3 partition(c18=1)select c1 from > source_aqe limit 16; > insert into table hive_snappy_aqe_table3 partition(c18=2)select c1 from > source_aqe limit 12; > set spark.sql.adaptive.enabled=false; > set spark.sql.adaptive.forceOptimizeSkewedJoin = false; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > > ###it will finish in 20s > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10; > set spark.sql.adaptive.enabled=true; > set spark.sql.adaptive.forceOptimizeSkewedJoin = true; > set spark.sql.adaptive.skewJoin.skewedPartitionFactor=1; > set spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=10KB; > set spark.sql.adaptive.advisoryPartitionSizeInBytes=100KB; > set spark.sql.autoBroadcastJoinThreshold = 51200; > ###it will take very long time > select * from hive_snappy_aqe_table1 join hive_snappy_aqe_table2 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table2.c18 join > hive_snappy_aqe_table3 on > hive_snappy_aqe_table1.c18=hive_snappy_aqe_table3.c18 limit 10;{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org