[jira] [Commented] (SPARK-37068) Confusing tgz filename for download
[ https://issues.apache.org/jira/browse/SPARK-37068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433211#comment-17433211 ] Sean R. Owen commented on SPARK-37068: -- Yes, too late to change it, but the 'hadoop-3.2' in the file name means '... or later' really. The code is compiled vs Hadoop 3.3. We'll eventually fix the profile names and thus release tarball, but that is the right download. > Confusing tgz filename for download > --- > > Key: SPARK-37068 > URL: https://issues.apache.org/jira/browse/SPARK-37068 > Project: Spark > Issue Type: Bug > Components: Build, Documentation >Affects Versions: 3.2.0 >Reporter: James Yu >Priority: Minor > Attachments: spark-download-issue.png > > > In the Spark download webpage [https://spark.apache.org/downloads.html], the > package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename > contains "hadoop3.2" in it. It is confusing; which version is correct? > > Download Apache Spark(TM) > # Choose a Spark release: 3.2.0 (Oct 13 2021) > # Choose a package type: Pre-built for Apache Hadoop 3.3 and later > # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37084) Set spark.sql.files.openCostInBytes to bytesConf
[ https://issues.apache.org/jira/browse/SPARK-37084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37084. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34353 [https://github.com/apache/spark/pull/34353] > Set spark.sql.files.openCostInBytes to bytesConf > > > Key: SPARK-37084 > URL: https://issues.apache.org/jira/browse/SPARK-37084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang He >Assignee: Yang He >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37084) Set spark.sql.files.openCostInBytes to bytesConf
[ https://issues.apache.org/jira/browse/SPARK-37084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37084: Assignee: Yang He > Set spark.sql.files.openCostInBytes to bytesConf > > > Key: SPARK-37084 > URL: https://issues.apache.org/jira/browse/SPARK-37084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang He >Assignee: Yang He >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37068) Confusing tgz filename for download
[ https://issues.apache.org/jira/browse/SPARK-37068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433205#comment-17433205 ] Hyukjin Kwon commented on SPARK-37068: -- The name of the tar file would have to be change. see also SPARK-33880. They really mean Hadoop 3 support in general. Since it's related out, it cannot be fixed at this moment though. cc [~srowen] [~sunchao] FYI. > Confusing tgz filename for download > --- > > Key: SPARK-37068 > URL: https://issues.apache.org/jira/browse/SPARK-37068 > Project: Spark > Issue Type: Bug > Components: Build, Documentation >Affects Versions: 3.2.0 >Reporter: James Yu >Priority: Minor > Attachments: spark-download-issue.png > > > In the Spark download webpage [https://spark.apache.org/downloads.html], the > package type dropdown says "Hadoop 3.3", but the Download Spark tgz filename > contains "hadoop3.2" in it. It is confusing; which version is correct? > > Download Apache Spark(TM) > # Choose a Spark release: 3.2.0 (Oct 13 2021) > # Choose a package type: Pre-built for Apache Hadoop 3.3 and later > # Download Spark: spark-3.2.0-bin-hadoop3.2.tgz > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37096) Where clause and where operator will report error on varchar column type
[ https://issues.apache.org/jira/browse/SPARK-37096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433201#comment-17433201 ] Hyukjin Kwon commented on SPARK-37096: -- cc [~cloud_fan] FYI > Where clause and where operator will report error on varchar column type > > > Key: SPARK-37096 > URL: https://issues.apache.org/jira/browse/SPARK-37096 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.1, 3.1.2 > Environment: HDP3.1.4 >Reporter: Ye Li >Priority: Major > > create table test1(col1 int, col2 varchar(120)) stored as orc; > insert into test1 values(123, 'abc'); > insert into test1 values(1234, 'abcd'); > > sparkSession.sql(‘select * from test1’) > is OK,but > sparkSession.sql(‘select * from test1 where col2 = “abc”’) > or > sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’) > report error: > java.lang.UnsuppotedOperationException: DataType: varchar(120) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37100) Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf
[ https://issues.apache.org/jira/browse/SPARK-37100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37100: - Fix Version/s: (was: 3.2.1) > Pandas groupby UDFs would benefit from automatically redistributing data on > the groupby key in order to prevent network issues running udf > -- > > Key: SPARK-37100 > URL: https://issues.apache.org/jira/browse/SPARK-37100 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.2 >Reporter: Richard Williamson >Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > when running high cardinality pandas udf groupby steps (100,000s+ of unique > groups) - jobs will either fail or have high amount of task failures due to > network errors on larger clusters 100+ nodes - this was not the specific code > causing issues but should be close to representative: > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql.functions import rand > from fancyimpute import IterativeSVD > import numpy as np > import pandas as pd > > df = spark.range(0, 10).withColumn('v', rand()) > @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) > def solver(pdf): > pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy())) > return pdf > > df.groupby('id').apply(solver).count() > > df.repartition('id') – this is required to fix it - can we make this > automatically happen without any adverse impacts? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37096) Where clause and where operator will report error on varchar column type
[ https://issues.apache.org/jira/browse/SPARK-37096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37096: - Priority: Major (was: Critical) > Where clause and where operator will report error on varchar column type > > > Key: SPARK-37096 > URL: https://issues.apache.org/jira/browse/SPARK-37096 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.1, 3.1.2 > Environment: HDP3.1.4 >Reporter: Ye Li >Priority: Major > > create table test1(col1 int, col2 varchar(120)) stored as orc; > insert into test1 values(123, 'abc'); > insert into test1 values(1234, 'abcd'); > > sparkSession.sql(‘select * from test1’) > is OK,but > sparkSession.sql(‘select * from test1 where col2 = “abc”’) > or > sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’) > report error: > java.lang.UnsuppotedOperationException: DataType: varchar(120) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37100) Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf
Richard Williamson created SPARK-37100: -- Summary: Pandas groupby UDFs would benefit from automatically redistributing data on the groupby key in order to prevent network issues running udf Key: SPARK-37100 URL: https://issues.apache.org/jira/browse/SPARK-37100 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.2 Reporter: Richard Williamson Fix For: 3.2.1 when running high cardinality pandas udf groupby steps (100,000s+ of unique groups) - jobs will either fail or have high amount of task failures due to network errors on larger clusters 100+ nodes - this was not the specific code causing issues but should be close to representative: from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.functions import rand from fancyimpute import IterativeSVD import numpy as np import pandas as pd df = spark.range(0, 10).withColumn('v', rand()) @pandas_udf(df.schema, PandasUDFType.GROUPED_MAP) def solver(pdf): pd.DataFrame(data=IterativeSVD(verbose=False).fit_transform(pdf.to_numpy())) return pdf df.groupby('id').apply(solver).count() df.repartition('id') – this is required to fix it - can we make this automatically happen without any adverse impacts? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36554) Error message while trying to use spark sql functions directly on dataframe columns without using select expression
[ https://issues.apache.org/jira/browse/SPARK-36554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433137#comment-17433137 ] Nicolas Azrak commented on SPARK-36554: --- [~lekshmiii] I've added a test to validate this is working. If you are using spark in a project and need this fix you would have to compile it using the patch I've submitted in the PR. > Error message while trying to use spark sql functions directly on dataframe > columns without using select expression > --- > > Key: SPARK-36554 > URL: https://issues.apache.org/jira/browse/SPARK-36554 > Project: Spark > Issue Type: Bug > Components: Documentation, Examples, PySpark >Affects Versions: 3.1.1 >Reporter: Lekshmi Ramachandran >Priority: Minor > Labels: documentation, features, functions, spark-sql > Attachments: Screen Shot .png > > Original Estimate: 24h > Remaining Estimate: 24h > > The below code generates a dataframe successfully . Here make_date function > is used inside a select expression > > from pyspark.sql.functions import expr, make_date > df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', > 'M', 'D']) > df.select("*",expr("make_date(Y,M,D) as lk")).show() > > The below code fails with a message "cannot import name 'make_date' from > 'pyspark.sql.functions'" . Here the make_date function is directly called on > dataframe columns without select expression > > from pyspark.sql.functions import make_date > df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', > 'M', 'D']) > df.select(make_date(df.Y,df.M,df.D).alias("datefield")).show() > > The error message generated is misleading when it says "cannot import > make_date from pyspark.sql.functions" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36554) Error message while trying to use spark sql functions directly on dataframe columns without using select expression
[ https://issues.apache.org/jira/browse/SPARK-36554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432714#comment-17432714 ] Lekshmi Ramachandran edited comment on SPARK-36554 at 10/22/21, 5:27 PM: - @Nicolas Azrak So how do I test if it is working ? was (Author: lekshmiii): @Nicolas Azrak So how do it test if it is working ? > Error message while trying to use spark sql functions directly on dataframe > columns without using select expression > --- > > Key: SPARK-36554 > URL: https://issues.apache.org/jira/browse/SPARK-36554 > Project: Spark > Issue Type: Bug > Components: Documentation, Examples, PySpark >Affects Versions: 3.1.1 >Reporter: Lekshmi Ramachandran >Priority: Minor > Labels: documentation, features, functions, spark-sql > Attachments: Screen Shot .png > > Original Estimate: 24h > Remaining Estimate: 24h > > The below code generates a dataframe successfully . Here make_date function > is used inside a select expression > > from pyspark.sql.functions import expr, make_date > df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', > 'M', 'D']) > df.select("*",expr("make_date(Y,M,D) as lk")).show() > > The below code fails with a message "cannot import name 'make_date' from > 'pyspark.sql.functions'" . Here the make_date function is directly called on > dataframe columns without select expression > > from pyspark.sql.functions import make_date > df = spark.createDataFrame([(2020, 6, 26), (1000, 2, 29), (-44, 1, 1)],['Y', > 'M', 'D']) > df.select(make_date(df.Y,df.M,df.D).alias("datefield")).show() > > The error message generated is misleading when it says "cannot import > make_date from pyspark.sql.functions" > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-37091: - Priority: Trivial (was: Major) > Support Java 17 in SparkR SystemRequirements > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Trivial > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37091) Support Java 17 in SparkR SystemRequirements
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433078#comment-17433078 ] Dongjoon Hyun edited comment on SPARK-37091 at 10/22/21, 5:13 PM: -- BTW, [~Bidek]. Please don't set `Target Version` next time. Apache Spark community has a policy for that. - [https://spark.apache.org/contributing.html] {code} Do not set the following fields: - Fix Version. This is assigned by committers only when resolved. - Target Version. This is assigned by committers to indicate a PR has been accepted for possible fix by the target version. {code} was (Author: dongjoon): BTW, [~Bidek]. Please don't set `Target Version`. - [https://spark.apache.org/contributing.html] {code} Do not set the following fields: - Fix Version. This is assigned by committers only when resolved. - Target Version. This is assigned by committers to indicate a PR has been accepted for possible fix by the target version. {code} > Support Java 17 in SparkR SystemRequirements > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37091: -- Fix Version/s: (was: 3.2.1) > Support Java 17 in SparkR SystemRequirements > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37091) Support Java 17 in SparkR SystemRequirements
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433078#comment-17433078 ] Dongjoon Hyun edited comment on SPARK-37091 at 10/22/21, 5:13 PM: -- BTW, [~Bidek]. Please don't set `Fix Version` and `Target Version` next time. Apache Spark community has a policy for that. The fields have different meaning in the community. - [https://spark.apache.org/contributing.html] {code} Do not set the following fields: - Fix Version. This is assigned by committers only when resolved. - Target Version. This is assigned by committers to indicate a PR has been accepted for possible fix by the target version. {code} was (Author: dongjoon): BTW, [~Bidek]. Please don't set `Target Version` next time. Apache Spark community has a policy for that. - [https://spark.apache.org/contributing.html] {code} Do not set the following fields: - Fix Version. This is assigned by committers only when resolved. - Target Version. This is assigned by committers to indicate a PR has been accepted for possible fix by the target version. {code} > Support Java 17 in SparkR SystemRequirements > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37091) Support Java 17 in SparkR SystemRequirements
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433078#comment-17433078 ] Dongjoon Hyun commented on SPARK-37091: --- BTW, [~Bidek]. Please don't set `Target Version`. - [https://spark.apache.org/contributing.html] {code} Do not set the following fields: - Fix Version. This is assigned by committers only when resolved. - Target Version. This is assigned by committers to indicate a PR has been accepted for possible fix by the target version. {code} > Support Java 17 in SparkR SystemRequirements > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37091: -- Target Version/s: (was: 3.3.0) > Support Java 17 in SparkR SystemRequirements > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Support Java 17 in SparkR SystemRequirements
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37091: -- Summary: Support Java 17 in SparkR SystemRequirements (was: Bump SystemRequirements to use Java 17) > Support Java 17 in SparkR SystemRequirements > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java 17
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darek updated SPARK-37091: -- Description: Please bump Java version to <= 17 in [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] Currently it is set to be: {code:java} SystemRequirements: Java (>= 8, < 12){code} [PR|https://github.com/apache/spark/pull/34371] has been created for this issue already. was: Please bump Java version to <= 17 in [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] Currently it is set to be: {code:java} SystemRequirements: Java (>= 8, < 12){code} [PR|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION#L16] has been created for this issue already. > Bump SystemRequirements to use Java 17 > -- > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > [PR|https://github.com/apache/spark/pull/34371] has been created for this > issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java 17
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darek updated SPARK-37091: -- Description: Please bump Java version to <= 17 in [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] Currently it is set to be: {code:java} SystemRequirements: Java (>= 8, < 12){code} [PR|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION#L16] has been created for this issue already. was: Please bump Java version to <= 17 in [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] Currently it is set to be: {code:java} SystemRequirements: Java (>= 8, < 12){code} > Bump SystemRequirements to use Java 17 > -- > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > > [PR|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION#L16] > has been created for this issue already. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java 17
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darek updated SPARK-37091: -- Target Version/s: 3.3.0 (was: 3.2.0) Affects Version/s: (was: 3.2.0) 3.3.0 Description: Please bump Java version to <= 17 in [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] Currently it is set to be: {code:java} SystemRequirements: Java (>= 8, < 12){code} was: Please bump Java version to > 11 in [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] Currently it is set to be: {code:java} SystemRequirements: Java (>= 8, < 12){code} Summary: Bump SystemRequirements to use Java 17 (was: Bump SystemRequirements to use Java > 11) > Bump SystemRequirements to use Java 17 > -- > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.3.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to <= 17 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37091) Bump SystemRequirements to use Java > 11
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darek updated SPARK-37091: -- Parent: SPARK-33772 Issue Type: Sub-task (was: Improvement) > Bump SystemRequirements to use Java > 11 > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 3.2.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to > 11 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35703) Relax constraint for Spark bucket join and remove HashClusteredDistribution
[ https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-35703: - Summary: Relax constraint for Spark bucket join and remove HashClusteredDistribution (was: Remove HashClusteredDistribution) > Relax constraint for Spark bucket join and remove HashClusteredDistribution > --- > > Key: SPARK-35703 > URL: https://issues.apache.org/jira/browse/SPARK-35703 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Currently Spark has {{HashClusteredDistribution}} and > {{ClusteredDistribution}}. The only difference between the two is that the > former is more strict when deciding whether bucket join is allowed to avoid > shuffle: comparing to the latter, it requires *exact* match between the > clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and > the join keys. However, this is unnecessary, as we should be able to avoid > shuffle when the set of clustering keys is a subset of join keys, just like > {{ClusteredDistribution}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37091) Bump SystemRequirements to use Java > 11
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37091: Assignee: Apache Spark > Bump SystemRequirements to use Java > 11 > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.2.0 >Reporter: Darek >Assignee: Apache Spark >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to > 11 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37091) Bump SystemRequirements to use Java > 11
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433051#comment-17433051 ] Apache Spark commented on SPARK-37091: -- User 'Bidek56' has created a pull request for this issue: https://github.com/apache/spark/pull/34371 > Bump SystemRequirements to use Java > 11 > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.2.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to > 11 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37091) Bump SystemRequirements to use Java > 11
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433053#comment-17433053 ] Apache Spark commented on SPARK-37091: -- User 'Bidek56' has created a pull request for this issue: https://github.com/apache/spark/pull/34371 > Bump SystemRequirements to use Java > 11 > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.2.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to > 11 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37091) Bump SystemRequirements to use Java > 11
[ https://issues.apache.org/jira/browse/SPARK-37091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37091: Assignee: (was: Apache Spark) > Bump SystemRequirements to use Java > 11 > > > Key: SPARK-37091 > URL: https://issues.apache.org/jira/browse/SPARK-37091 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.2.0 >Reporter: Darek >Priority: Major > Labels: newbie > Fix For: 3.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Please bump Java version to > 11 in > [DESCRIPTION|https://github.com/apache/spark/blob/f9f95686cb397271f55aaff29ec4352b4ef9aade/R/pkg/DESCRIPTION] > Currently it is set to be: > {code:java} > SystemRequirements: Java (>= 8, < 12){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37047) Add overloads for lpad and rpad for BINARY strings
[ https://issues.apache.org/jira/browse/SPARK-37047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433048#comment-17433048 ] Apache Spark commented on SPARK-37047: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/34370 > Add overloads for lpad and rpad for BINARY strings > -- > > Key: SPARK-37047 > URL: https://issues.apache.org/jira/browse/SPARK-37047 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Menelaos Karavelas >Assignee: Menelaos Karavelas >Priority: Major > Fix For: 3.3.0 > > > Currently, `lpad` and `rpad` accept BINARY strings as input (both in terms of > input string to be padded and padding pattern), and these strings get cast to > UTF8 strings. The result of the operation is a UTF8 string which may be > invalid as it can contain non-UTF8 characters. > What we would like to do is to overload `lpad` and `rpad` to accept BINARY > strings as inputs (both for the string to be padded and the padding pattern) > and produce a left or right padded BINARY string as output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37089) ParquetFileFormat registers task completion listeners lazily, causing Python writer thread to segfault when off-heap vectorized reader is enabled
[ https://issues.apache.org/jira/browse/SPARK-37089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433020#comment-17433020 ] Apache Spark commented on SPARK-37089: -- User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/34369 > ParquetFileFormat registers task completion listeners lazily, causing Python > writer thread to segfault when off-heap vectorized reader is enabled > - > > Key: SPARK-37089 > URL: https://issues.apache.org/jira/browse/SPARK-37089 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Major > > The task completion listener that closes the vectorized reader is registered > lazily in ParquetFileFormat#buildReaderWithPartitionValues(). Since task > completion listeners are executed in reverse order of registration, it always > runs before the Python writer thread can be interrupted. > This contradicts the assumption in > https://issues.apache.org/jira/browse/SPARK-37088 / > https://github.com/apache/spark/pull/34245 that task completion listeners are > registered bottom-up, preventing that fix from working properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37067) DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon
[ https://issues.apache.org/jira/browse/SPARK-37067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37067. - Fix Version/s: 3.3.0 3.2.1 Assignee: Linhong Liu Resolution: Fixed > DateTimeUtils.stringToTimestamp() incorrectly rejects timezone without colon > > > Key: SPARK-37067 > URL: https://issues.apache.org/jira/browse/SPARK-37067 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.1.0, 3.2.0 >Reporter: Linhong Liu >Assignee: Linhong Liu >Priority: Major > Fix For: 3.2.1, 3.3.0 > > > For the zoneid with format like "+" or "+0730", it can be parsed by > `ZoneId.of()` but will rejected by Spark's > `DateTimeUtils.stringToTimestamp()`. it means we will return null for some > valid datetime string, such as: `2021-10-11T03:58:03.000+0700` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37072) Pass all UTs in `repl` with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432964#comment-17432964 ] Apache Spark commented on SPARK-37072: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/34368 > Pass all UTs in `repl` with Java 17 > --- > > Key: SPARK-37072 > URL: https://issues.apache.org/jira/browse/SPARK-37072 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Run `mvn clean install -pl repl` with Java 17 > {code:java} > Run completed in 30 seconds, 826 milliseconds. > Total number of tests run: 42 > Suites: completed 6, aborted 0 > Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0 > *** 9 TESTS FAILED *** > {code} > The test failed as similar reasons: > {code:java} > - broadcast vars *** FAILED *** > isContain was true Interpreter output contained 'Exception': > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> > scala> array: Array[Int] = Array(0, 0, 0, 0, 0) > > scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = > Broadcast(0) > > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2879/0x00080188b928.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2907/0x0008019536f8.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> | > scala> :quit (ReplSuite.scala:83) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37072) Pass all UTs in `repl` with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37072: Assignee: (was: Apache Spark) > Pass all UTs in `repl` with Java 17 > --- > > Key: SPARK-37072 > URL: https://issues.apache.org/jira/browse/SPARK-37072 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Run `mvn clean install -pl repl` with Java 17 > {code:java} > Run completed in 30 seconds, 826 milliseconds. > Total number of tests run: 42 > Suites: completed 6, aborted 0 > Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0 > *** 9 TESTS FAILED *** > {code} > The test failed as similar reasons: > {code:java} > - broadcast vars *** FAILED *** > isContain was true Interpreter output contained 'Exception': > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> > scala> array: Array[Int] = Array(0, 0, 0, 0, 0) > > scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = > Broadcast(0) > > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2879/0x00080188b928.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2907/0x0008019536f8.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> | > scala> :quit (ReplSuite.scala:83) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37072) Pass all UTs in `repl` with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432963#comment-17432963 ] Apache Spark commented on SPARK-37072: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/34368 > Pass all UTs in `repl` with Java 17 > --- > > Key: SPARK-37072 > URL: https://issues.apache.org/jira/browse/SPARK-37072 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > Run `mvn clean install -pl repl` with Java 17 > {code:java} > Run completed in 30 seconds, 826 milliseconds. > Total number of tests run: 42 > Suites: completed 6, aborted 0 > Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0 > *** 9 TESTS FAILED *** > {code} > The test failed as similar reasons: > {code:java} > - broadcast vars *** FAILED *** > isContain was true Interpreter output contained 'Exception': > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> > scala> array: Array[Int] = Array(0, 0, 0, 0, 0) > > scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = > Broadcast(0) > > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2879/0x00080188b928.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2907/0x0008019536f8.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> | > scala> :quit (ReplSuite.scala:83) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37072) Pass all UTs in `repl` with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37072: Assignee: Apache Spark > Pass all UTs in `repl` with Java 17 > --- > > Key: SPARK-37072 > URL: https://issues.apache.org/jira/browse/SPARK-37072 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > Run `mvn clean install -pl repl` with Java 17 > {code:java} > Run completed in 30 seconds, 826 milliseconds. > Total number of tests run: 42 > Suites: completed 6, aborted 0 > Tests: succeeded 33, failed 9, canceled 0, ignored 0, pending 0 > *** 9 TESTS FAILED *** > {code} > The test failed as similar reasons: > {code:java} > - broadcast vars *** FAILED *** > isContain was true Interpreter output contained 'Exception': > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT > /_/ > > Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 17) > Type in expressions to have them evaluated. > Type :help for more information. > > scala> > scala> array: Array[Int] = Array(0, 0, 0, 0, 0) > > scala> broadcastArray: org.apache.spark.broadcast.Broadcast[Array[Int]] = > Broadcast(0) > > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2879/0x00080188b928.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> > scala> java.lang.IllegalAccessException: Can not set final $iw field > $Lambda$2907/0x0008019536f8.arg$1 to $iw > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:76) > at > java.base/jdk.internal.reflect.UnsafeFieldAccessorImpl.throwFinalFieldIllegalAccessException(UnsafeFieldAccessorImpl.java:80) > at > java.base/jdk.internal.reflect.UnsafeQualifiedObjectFieldAccessorImpl.set(UnsafeQualifiedObjectFieldAccessorImpl.java:79) > at java.base/java.lang.reflect.Field.set(Field.java:799) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:398) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2490) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:414) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:406) > at org.apache.spark.rdd.RDD.map(RDD.scala:413) > ... 95 elided > > scala> | > scala> :quit (ReplSuite.scala:83) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading
[ https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432914#comment-17432914 ] jinhai commented on SPARK-37006: hi [~Ngone51], can you review this issue for me? > MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs > when shuffle reading > - > > Key: SPARK-37006 > URL: https://issues.apache.org/jira/browse/SPARK-37006 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.2 >Reporter: jinhai >Priority: Major > > When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, > in order to obtain the hostLocalDirs value, we need to send an RPC request > through ExternalBlockStoreClient or NettyBlockTransferService. Then get > shuffle data according to blockId and localDirs. > We can add localDir to the BlockManagerId class of MapStatus, so that we can > get localDir directly when fetch host-local blocks without sending RPC > requests. > The benefits are: > 1. No need to send RPC request localDirs value when fetchHostLocalBlocks; > 2. When the external shuffle service is enabled, there is no need to register > ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save > the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class > through leveldb. > 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager > class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading
[ https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jinhai updated SPARK-37006: --- Comment: was deleted (was: hi [~Ngone51], can you review this issue for me?) > MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs > when shuffle reading > - > > Key: SPARK-37006 > URL: https://issues.apache.org/jira/browse/SPARK-37006 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.2 >Reporter: jinhai >Priority: Major > > When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, > in order to obtain the hostLocalDirs value, we need to send an RPC request > through ExternalBlockStoreClient or NettyBlockTransferService. Then get > shuffle data according to blockId and localDirs. > We can add localDir to the BlockManagerId class of MapStatus, so that we can > get localDir directly when fetch host-local blocks without sending RPC > requests. > The benefits are: > 1. No need to send RPC request localDirs value when fetchHostLocalBlocks; > 2. When the external shuffle service is enabled, there is no need to register > ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save > the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class > through leveldb. > 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager > class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading
[ https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429079#comment-17429079 ] jinhai edited comment on SPARK-37006 at 10/22/21, 11:01 AM: Or whether we can generate localDirs based on appId and execId, just like DiskBlockManager.getFile, so that we don't need to save localDirs in MapStatus, just add appId to MapStatus was (Author: csbliss): Or whether we can generate localDirs based on appId and execId, just like DiskBlockManager.getFile, so that we don't need to save localDirs in MapStatus, just add appId. > MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs > when shuffle reading > - > > Key: SPARK-37006 > URL: https://issues.apache.org/jira/browse/SPARK-37006 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.2 >Reporter: jinhai >Priority: Major > > When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, > in order to obtain the hostLocalDirs value, we need to send an RPC request > through ExternalBlockStoreClient or NettyBlockTransferService. Then get > shuffle data according to blockId and localDirs. > We can add localDir to the BlockManagerId class of MapStatus, so that we can > get localDir directly when fetch host-local blocks without sending RPC > requests. > The benefits are: > 1. No need to send RPC request localDirs value when fetchHostLocalBlocks; > 2. When the external shuffle service is enabled, there is no need to register > ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save > the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class > through leveldb. > 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager > class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading
[ https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jinhai updated SPARK-37006: --- Description: When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, in order to obtain the hostLocalDirs value, we need to send an RPC request through ExternalBlockStoreClient or NettyBlockTransferService. Then get shuffle data according to blockId and localDirs. We can add localDir to the BlockManagerId class of MapStatus, so that we can get localDir directly when fetch host-local blocks without sending RPC requests. The benefits are: 1. No need to send RPC request localDirs value when fetchHostLocalBlocks; 2. When the external shuffle service is enabled, there is no need to register ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class through leveldb. 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager class. was: In shuffle reading, in order to get the hostLocalDirs value when executing fetchHostLocalBlocks, we need ExternalBlockStoreClient or NettyBlockTransferService to make a rpc request. And when externalShuffleServiceEnabled, there is no need to registerExecutor and so on in the ExternalShuffleBlockResolver class. Throughout the spark shuffle module, a lot of code logic is written to deal with localDirs. We can directly add localDirs to the BlockManagerId class of MapStatus to get datafile and indexfile. > MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs > when shuffle reading > - > > Key: SPARK-37006 > URL: https://issues.apache.org/jira/browse/SPARK-37006 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.2 >Reporter: jinhai >Priority: Major > > When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, > in order to obtain the hostLocalDirs value, we need to send an RPC request > through ExternalBlockStoreClient or NettyBlockTransferService. Then get > shuffle data according to blockId and localDirs. > We can add localDir to the BlockManagerId class of MapStatus, so that we can > get localDir directly when fetch host-local blocks without sending RPC > requests. > The benefits are: > 1. No need to send RPC request localDirs value when fetchHostLocalBlocks; > 2. When the external shuffle service is enabled, there is no need to register > ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save > the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class > through leveldb. > 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager > class. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37099: Assignee: (was: Apache Spark) > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 80% usage of window function follows this > pattern: > {code:java} > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432910#comment-17432910 ] Apache Spark commented on SPARK-37099: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/34367 > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 80% usage of window function follows this > pattern: > {code:java} > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37099: Assignee: Apache Spark > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 80% usage of window function follows this > pattern: > {code:java} > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-37099: - Description: in JD, we found that more than 80% usage of window function follows this pattern: select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. was: in JD, we found that more than 80% usage of window function follows this pattern: select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 80% usage of window function follows this > pattern: > > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-37099: - Attachment: skewed_window.png > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 80% usage of window function follows this > pattern: > > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-37099: - Description: in JD, we found that more than 80% usage of window function follows this pattern: {code:java} select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k{code} However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. was: in JD, we found that more than 80% usage of window function follows this pattern: select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 80% usage of window function follows this > pattern: > {code:java} > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
zhengruifeng created SPARK-37099: Summary: Impl a rank-based filter to optimize top-k computation Key: SPARK-37099 URL: https://issues.apache.org/jira/browse/SPARK-37099 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: zhengruifeng in JD, we found that more than 80% usage of window function follows this pattern: select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. This is a real-world skewed-window case in our system: !image-2021-10-22-18-46-58-496.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-37099: - Description: in JD, we found that more than 80% usage of window function follows this pattern: select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. was: in JD, we found that more than 80% usage of window function follows this pattern: select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. This is a real-world skewed-window case in our system: !image-2021-10-22-18-46-58-496.png! > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > > in JD, we found that more than 80% usage of window function follows this > pattern: > > select (... row_number() over(partition by ... order by ...) as rn) > where rn ==[\<=] k > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37016) Publicise UpperCaseCharStream
[ https://issues.apache.org/jira/browse/SPARK-37016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432902#comment-17432902 ] dohongdayi commented on SPARK-37016: Anyone care about this issue? > Publicise UpperCaseCharStream > - > > Key: SPARK-37016 > URL: https://issues.apache.org/jira/browse/SPARK-37016 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.8, 3.0.3, 3.1.1, 3.1.2, 3.2.0 >Reporter: dohongdayi >Priority: Major > > Many Spark extension projects are copying `UpperCaseCharStream` because it is > private beneath `parser` package, such as: > [Delta > Lake|https://github.com/delta-io/delta/blob/625de3b305f109441ad04b20dba91dd6c4e1d78e/core/src/main/scala/io/delta/sql/parser/DeltaSqlParser.scala#L290] > [Hudi|https://github.com/apache/hudi/blob/3f8ca1a3552bb866163d3b1648f68d9c4824e21d/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/parser/HoodieCommonSqlParser.scala#L112] > [Iceberg|https://github.com/apache/iceberg/blob/c3ac4c6ca74a0013b4705d5bd5d17fade8e6f499/spark3-extensions/src/main/scala/org/apache/spark/sql/catalyst/parser/extensions/IcebergSparkSqlExtensionsParser.scala#L175] > [Submarine|https://github.com/apache/submarine/blob/2faebb8efd69833853f62d89b4f1fea1b1718148/submarine-security/spark-security/src/main/scala/org/apache/submarine/spark/security/parser/UpperCaseCharStream.scala#L31] > [Kyuubi|https://github.com/apache/incubator-kyuubi/blob/8a5134e3223844714fc58833a6859d4df5b68d57/dev/kyuubi-extension-spark-common/src/main/scala/org/apache/kyuubi/sql/zorder/ZorderSparkSqlExtensionsParserBase.scala#L108] > [Spark-ACID|https://github.com/qubole/spark-acid/blob/19bd6db757677c40f448e85c74d9995ba97d5942/src/main/scala/com/qubole/spark/datasources/hiveacid/sql/catalyst/parser/ParseDriver.scala#L13] > We can publicise `UpperCaseCharStream` to eliminate code duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37098) Alter table properties should invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37098: Assignee: Apache Spark > Alter table properties should invalidate cache > -- > > Key: SPARK-37098 > URL: https://issues.apache.org/jira/browse/SPARK-37098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > The table properties can change the behavior of wriing. e.g. the parquet > table with `parquet.compression`. > If you execute the following SQL, we will get the file with snappy > compression rather than zstd. > {code:java} > CREATE TABLE t (c int) STORED AS PARQUET; > // cache table metadata > SELECT * FROM t; > ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd'); > INSERT INTO TABLE t values(1); > {code} > So we should invalidate the table cache after alter table properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37098) Alter table properties should invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37098: Assignee: (was: Apache Spark) > Alter table properties should invalidate cache > -- > > Key: SPARK-37098 > URL: https://issues.apache.org/jira/browse/SPARK-37098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > The table properties can change the behavior of wriing. e.g. the parquet > table with `parquet.compression`. > If you execute the following SQL, we will get the file with snappy > compression rather than zstd. > {code:java} > CREATE TABLE t (c int) STORED AS PARQUET; > // cache table metadata > SELECT * FROM t; > ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd'); > INSERT INTO TABLE t values(1); > {code} > So we should invalidate the table cache after alter table properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0
[ https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37097: Assignee: Apache Spark > yarn-cluster mode, unregister timeout cause spark retry but AM container exit > with code 0 > - > > Key: SPARK-37097 > URL: https://issues.apache.org/jira/browse/SPARK-37097 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > 1. Cluster mode AM shutdown hook triggered > 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM > container exit with code 0. > 3. Since RM lose connection with AM, then treat this container as failed. > 4. Then client side got application report as final status failed but am > container exit code 0. Then retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0
[ https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37097: Assignee: (was: Apache Spark) > yarn-cluster mode, unregister timeout cause spark retry but AM container exit > with code 0 > - > > Key: SPARK-37097 > URL: https://issues.apache.org/jira/browse/SPARK-37097 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > 1. Cluster mode AM shutdown hook triggered > 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM > container exit with code 0. > 3. Since RM lose connection with AM, then treat this container as failed. > 4. Then client side got application report as final status failed but am > container exit code 0. Then retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0
[ https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432892#comment-17432892 ] Apache Spark commented on SPARK-37097: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34366 > yarn-cluster mode, unregister timeout cause spark retry but AM container exit > with code 0 > - > > Key: SPARK-37097 > URL: https://issues.apache.org/jira/browse/SPARK-37097 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > 1. Cluster mode AM shutdown hook triggered > 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM > container exit with code 0. > 3. Since RM lose connection with AM, then treat this container as failed. > 4. Then client side got application report as final status failed but am > container exit code 0. Then retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37098) Alter table properties should invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-37098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432891#comment-17432891 ] Apache Spark commented on SPARK-37098: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34365 > Alter table properties should invalidate cache > -- > > Key: SPARK-37098 > URL: https://issues.apache.org/jira/browse/SPARK-37098 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: XiDuo You >Priority: Major > > The table properties can change the behavior of wriing. e.g. the parquet > table with `parquet.compression`. > If you execute the following SQL, we will get the file with snappy > compression rather than zstd. > {code:java} > CREATE TABLE t (c int) STORED AS PARQUET; > // cache table metadata > SELECT * FROM t; > ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd'); > INSERT INTO TABLE t values(1); > {code} > So we should invalidate the table cache after alter table properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37098) Alter table properties should invalidate cache
XiDuo You created SPARK-37098: - Summary: Alter table properties should invalidate cache Key: SPARK-37098 URL: https://issues.apache.org/jira/browse/SPARK-37098 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.2, 3.0.3, 3.3.0 Reporter: XiDuo You The table properties can change the behavior of wriing. e.g. the parquet table with `parquet.compression`. If you execute the following SQL, we will get the file with snappy compression rather than zstd. {code:java} CREATE TABLE t (c int) STORED AS PARQUET; // cache table metadata SELECT * FROM t; ALTER TABLE t SET TBLPROPERTIES('parquet.compression'='zstd'); INSERT INTO TABLE t values(1); {code} So we should invalidate the table cache after alter table properties. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0
[ https://issues.apache.org/jira/browse/SPARK-37097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37097: -- Description: 1. Cluster mode AM shutdown hook triggered 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM container exit with code 0. 3. Since RM lose connection with AM, then treat this container as failed. 4. Then client side got application report as final status failed but am container exit code 0. Then retry. was: Cluster mode AM shutdown hook triggered, am unregister from RM timeout, but AM shutdown hook have try catch, so AM container exit with code 0. But since RM lose connection with AM, then treat this container as failed. Then client side got application report as final status failed but am container exit code 0. Then retry. > yarn-cluster mode, unregister timeout cause spark retry but AM container exit > with code 0 > - > > Key: SPARK-37097 > URL: https://issues.apache.org/jira/browse/SPARK-37097 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > 1. Cluster mode AM shutdown hook triggered > 2. am unregister from RM timeout, but AM shutdown hook have try catch, so AM > container exit with code 0. > 3. Since RM lose connection with AM, then treat this container as failed. > 4. Then client side got application report as final status failed but am > container exit code 0. Then retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37073) Pass all UTs in `external/avro` with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37073. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34364 [https://github.com/apache/spark/pull/34364] > Pass all UTs in `external/avro` with Java 17 > > > Key: SPARK-37073 > URL: https://issues.apache.org/jira/browse/SPARK-37073 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.3.0 > > > Run `mvn clean install -pl external/avro` with Java 17 > > > {code:java} > Run completed in 43 seconds, 988 milliseconds. > Total number of tests run: 283 > Suites: completed 14, aborted 0 > Tests: succeeded 281, failed 2, canceled 0, ignored 2, pending 0 > *** 2 TESTS FAILED *** > {code} > > {code:java} > - support user provided non-nullable avro schema for nullable catalyst schema > without any null record *** FAILED *** > "Job aborted due to stage failure: Task 1 in stage 144.0 failed 1 times, > most recent failure: Lost task 1.0 in stage 144.0 (TID 250) (localhost > executor driver): org.apache.spark.SparkException: Task failed while writing > rows. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:516) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:345) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:136) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: > java.lang.NullPointerException: Cannot invoke "Object.getClass()" because > "datum" is null of string in string in field Name of test_schema in > test_schema > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:317) > at > org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:84) > at > org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:62) > at > org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:84) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:328) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1502) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:335) > ... 9 more > Caused by: java.lang.NullPointerException: Cannot invoke > "Object.getClass()" because "datum" is null of string in string in field Name > of test_schema in test_schema > at > org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:184) > at > org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:160) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:314) > ... 18 more > Caused by: java.lang.NullPointerException: Cannot invoke > "Object.getClass()" because "datum" is null > at > org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:68) > at > org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:151) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:83) > at > org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158) > at > org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:221) > at >
[jira] [Assigned] (SPARK-37073) Pass all UTs in `external/avro` with Java 17
[ https://issues.apache.org/jira/browse/SPARK-37073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37073: Assignee: Yang Jie > Pass all UTs in `external/avro` with Java 17 > > > Key: SPARK-37073 > URL: https://issues.apache.org/jira/browse/SPARK-37073 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > Run `mvn clean install -pl external/avro` with Java 17 > > > {code:java} > Run completed in 43 seconds, 988 milliseconds. > Total number of tests run: 283 > Suites: completed 14, aborted 0 > Tests: succeeded 281, failed 2, canceled 0, ignored 2, pending 0 > *** 2 TESTS FAILED *** > {code} > > {code:java} > - support user provided non-nullable avro schema for nullable catalyst schema > without any null record *** FAILED *** > "Job aborted due to stage failure: Task 1 in stage 144.0 failed 1 times, > most recent failure: Lost task 1.0 in stage 144.0 (TID 250) (localhost > executor driver): org.apache.spark.SparkException: Task failed while writing > rows. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:516) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:345) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$20(FileFormatWriter.scala:252) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:136) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: > java.lang.NullPointerException: Cannot invoke "Object.getClass()" because > "datum" is null of string in string in field Name of test_schema in > test_schema > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:317) > at > org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:84) > at > org.apache.spark.sql.avro.SparkAvroKeyRecordWriter.write(SparkAvroKeyOutputFormat.java:62) > at > org.apache.spark.sql.avro.AvroOutputWriter.write(AvroOutputWriter.scala:84) > at > org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:85) > at > org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:92) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:328) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1502) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:335) > ... 9 more > Caused by: java.lang.NullPointerException: Cannot invoke > "Object.getClass()" because "datum" is null of string in string in field Name > of test_schema in test_schema > at > org.apache.avro.generic.GenericDatumWriter.npe(GenericDatumWriter.java:184) > at > org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:160) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73) > at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:314) > ... 18 more > Caused by: java.lang.NullPointerException: Cannot invoke > "Object.getClass()" because "datum" is null > at > org.apache.avro.specific.SpecificDatumWriter.writeString(SpecificDatumWriter.java:68) > at > org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:151) > at > org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:83) > at > org.apache.avro.reflect.ReflectDatumWriter.write(ReflectDatumWriter.java:158) > at > org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:221) > at > org.apache.avro.specific.SpecificDatumWriter.writeField(SpecificDatumWriter.java:101) > at >
[jira] [Created] (SPARK-37097) yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0
angerszhu created SPARK-37097: - Summary: yarn-cluster mode, unregister timeout cause spark retry but AM container exit with code 0 Key: SPARK-37097 URL: https://issues.apache.org/jira/browse/SPARK-37097 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu Cluster mode AM shutdown hook triggered, am unregister from RM timeout, but AM shutdown hook have try catch, so AM container exit with code 0. But since RM lose connection with AM, then treat this container as failed. Then client side got application report as final status failed but am container exit code 0. Then retry. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37096) Where clause and where operator will report error on varchar column type
[ https://issues.apache.org/jira/browse/SPARK-37096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Li updated SPARK-37096: -- Description: create table test1(col1 int, col2 varchar(120)) stored as orc; insert into test1 values(123, 'abc'); insert into test1 values(1234, 'abcd'); sparkSession.sql(‘select * from test1’) is OK,but sparkSession.sql(‘select * from test1 where col2 = “abc”’) or sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’) report error: java.lang.UnsuppotedOperationException: DataType: varchar(120) was: create table test1(col1 int, col2 varchar(120)) stored as orc; insert into test1 values(123, 'abc'); insert into test1 values(1234, 'abcd'); sparkSession.sql(‘select * from bdctemp.liye_test202110212’) is OK,but sparkSession.sql(‘select * from bdctemp.liye_test202110212 where col2 = “abc”’) or sparkSession.sql(‘select * from bdctemp.liye_test202110212’).where(‘col2 = “abc”’) report error: java.lang.UnsuppotedOperationException: DataType: varchar(120) Environment: HDP3.1.4 Priority: Critical (was: Major) > Where clause and where operator will report error on varchar column type > > > Key: SPARK-37096 > URL: https://issues.apache.org/jira/browse/SPARK-37096 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.1, 3.1.2 > Environment: HDP3.1.4 >Reporter: Ye Li >Priority: Critical > > create table test1(col1 int, col2 varchar(120)) stored as orc; > insert into test1 values(123, 'abc'); > insert into test1 values(1234, 'abcd'); > > sparkSession.sql(‘select * from test1’) > is OK,but > sparkSession.sql(‘select * from test1 where col2 = “abc”’) > or > sparkSession.sql(‘select * from test1’).where(‘col2 = “abc”’) > report error: > java.lang.UnsuppotedOperationException: DataType: varchar(120) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37096) Where clause and where operator will report error on varchar column type
Ye Li created SPARK-37096: - Summary: Where clause and where operator will report error on varchar column type Key: SPARK-37096 URL: https://issues.apache.org/jira/browse/SPARK-37096 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.1.2, 3.1.1 Reporter: Ye Li create table test1(col1 int, col2 varchar(120)) stored as orc; insert into test1 values(123, 'abc'); insert into test1 values(1234, 'abcd'); sparkSession.sql(‘select * from bdctemp.liye_test202110212’) is OK,but sparkSession.sql(‘select * from bdctemp.liye_test202110212 where col2 = “abc”’) or sparkSession.sql(‘select * from bdctemp.liye_test202110212’).where(‘col2 = “abc”’) report error: java.lang.UnsuppotedOperationException: DataType: varchar(120) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org