[jira] [Resolved] (SPARK-48138) Disable a flaky `SparkSessionE2ESuite.interrupt tag` test
[ https://issues.apache.org/jira/browse/SPARK-48138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-48138. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46396 [https://github.com/apache/spark/pull/46396] > Disable a flaky `SparkSessionE2ESuite.interrupt tag` test > - > > Key: SPARK-48138 > URL: https://issues.apache.org/jira/browse/SPARK-48138 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 > (Master, 5/5) > - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 > (Master, 5/4) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48138) Disable a flaky `SparkSessionE2ESuite.interrupt tag` test
[ https://issues.apache.org/jira/browse/SPARK-48138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-48138: Assignee: Dongjoon Hyun > Disable a flaky `SparkSessionE2ESuite.interrupt tag` test > - > > Key: SPARK-48138 > URL: https://issues.apache.org/jira/browse/SPARK-48138 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 > (Master, 5/5) > - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 > (Master, 5/4) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843620#comment-17843620 ] Sandeep Katta commented on SPARK-35531: --- Bug is tracked here https://issues.apache.org/jira/browse/SPARK-48140 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48140) Can not alter bucketed table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-48140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Katta updated SPARK-48140: -- Description: Running below SQL command throws exception CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); *Exception:* {code:java} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), FieldSchema(name:s1, type:int, comment:null)] at org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552) at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1145) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$alterTable$1(HiveClientImpl.scala:594) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:587) at org.apache.spark.sql.hive.client.HiveClient.alterTable(HiveClient.scala:124) at org.apache.spark.sql.hive.client.HiveClient.alterTable$(HiveClient.scala:123) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:93) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$alterTable$1(HiveExternalCatalog.scala:687) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) ... 62 more {code} was: Running below SQL command throws exception CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); > Can not alter bucketed table if create table with upper case schema > --- > > Key: SPARK-48140 > URL: https://issues.apache.org/jira/browse/SPARK-48140 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Sandeep Katta >Priority: Major > > Running below SQL command throws exception > > CREATE TABLE TEST1( > V1 BIGINT, > S1 INT) > PARTITIONED BY (PK BIGINT) > CLUSTERED BY (V1) > SORTED BY (S1) > INTO 200 BUCKETS > STORED AS PARQUET; > ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); > *Exception:* > {code:java} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns > V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, > comment:null), FieldSchema(name:s1, type:int, comment:null)] > at > org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552) > at > org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1145) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$alterTable$1(HiveClientImpl.scala:594) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:587) > at > org.apache.spark.sql.hive.client.HiveClient.alterTable(HiveClient.scala:124) > at > org.apache.spark.sql.hive.client.HiveClient.alterTable$(HiveClient.scala:123) > at > org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:93) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$alterTable$1(HiveExternalCatalog.scala:687) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > ... 62 more > {code} -- This message was sent by Atlassian Jira (v8.
[jira] [Created] (SPARK-48140) Can not alter bucketed table if create table with upper case schema
Sandeep Katta created SPARK-48140: - Summary: Can not alter bucketed table if create table with upper case schema Key: SPARK-48140 URL: https://issues.apache.org/jira/browse/SPARK-48140 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Sandeep Katta Running below SQL command throws exception CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843619#comment-17843619 ] Sandeep Katta commented on SPARK-35531: --- [~angerszhuuu] , I do see same issue in alter table command, I tested in SPARK-3.5.0 and issue still exists {code:java} CREATE TABLE TEST1( V1 BIGINT, S1 INT) PARTITIONED BY (PK BIGINT) CLUSTERED BY (V1) SORTED BY (S1) INTO 200 BUCKETS STORED AS PARQUET; ALTER TABLE test1 SET TBLPROPERTIES ('comment' = 'This is a new comment.'); {code} {code:java} Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), FieldSchema(name:s1, type:int, comment:null)] at org.apache.hadoop.hive.ql.metadata.Table.setBucketCols(Table.java:552) at org.apache.spark.sql.hive.client.HiveClientImpl$.toHiveTable(HiveClientImpl.scala:1145) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$alterTable$1(HiveClientImpl.scala:594) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:587) at org.apache.spark.sql.hive.client.HiveClient.alterTable(HiveClient.scala:124) at org.apache.spark.sql.hive.client.HiveClient.alterTable$(HiveClient.scala:123) at org.apache.spark.sql.hive.client.HiveClientImpl.alterTable(HiveClientImpl.scala:93) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$alterTable$1(HiveExternalCatalog.scala:687) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) ... 62 more {code} > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48138) Disable a flaky `SparkSessionE2ESuite.interrupt tag` test
[ https://issues.apache.org/jira/browse/SPARK-48138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48138: -- Description: - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 (Master, 5/5) - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 (Master, 5/4) > Disable a flaky `SparkSessionE2ESuite.interrupt tag` test > - > > Key: SPARK-48138 > URL: https://issues.apache.org/jira/browse/SPARK-48138 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 > (Master, 5/5) > - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 > (Master, 5/4) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48139) Re-enable `SparkSessionE2ESuite.interrupt tag`
[ https://issues.apache.org/jira/browse/SPARK-48139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48139: -- Description: (was: - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 (Master, 5/5) - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 (Master, 5/4)) > Re-enable `SparkSessionE2ESuite.interrupt tag` > -- > > Key: SPARK-48139 > URL: https://issues.apache.org/jira/browse/SPARK-48139 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48139) Re-enable `SparkSessionE2ESuite.interrupt tag`
[ https://issues.apache.org/jira/browse/SPARK-48139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48139: -- Description: - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 (Master, 5/5) - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 (Master, 5/4) > Re-enable `SparkSessionE2ESuite.interrupt tag` > -- > > Key: SPARK-48139 > URL: https://issues.apache.org/jira/browse/SPARK-48139 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > - https://github.com/apache/spark/actions/runs/8962353911/job/24611130573 > (Master, 5/5) > - https://github.com/apache/spark/actions/runs/8948176536/job/24581022674 > (Master, 5/4) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48138) Disable a flaky `SparkSessionE2ESuite.interrupt tag` test
[ https://issues.apache.org/jira/browse/SPARK-48138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48138: --- Labels: pull-request-available (was: ) > Disable a flaky `SparkSessionE2ESuite.interrupt tag` test > - > > Key: SPARK-48138 > URL: https://issues.apache.org/jira/browse/SPARK-48138 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48138) Disable a flaky `SparkSessionE2ESuite.interrupt tag` test
Dongjoon Hyun created SPARK-48138: - Summary: Disable a flaky `SparkSessionE2ESuite.interrupt tag` test Key: SPARK-48138 URL: https://issues.apache.org/jira/browse/SPARK-48138 Project: Spark Issue Type: Sub-task Components: Connect, Tests Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48136) Always upload Spark Connect log files
[ https://issues.apache.org/jira/browse/SPARK-48136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-48136. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46393 [https://github.com/apache/spark/pull/46393] > Always upload Spark Connect log files > - > > Key: SPARK-48136 > URL: https://issues.apache.org/jira/browse/SPARK-48136 > Project: Spark > Issue Type: Improvement > Components: Connect, Project Infra, PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We should always upload log files if it is not success -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47777) Add spark connect test for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-4: Assignee: (was: Chaoqin Li) > Add spark connect test for python streaming data source > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Test > Components: PySpark, SS, Tests >Affects Versions: 3.5.1 >Reporter: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Make python streaming data source pyspark test also runs on spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-47777) Add spark connect test for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-4: -- Reverted at https://github.com/apache/spark/commit/4e69857195a6f95c22f962e3eed950876036c04f > Add spark connect test for python streaming data source > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Test > Components: PySpark, SS, Tests >Affects Versions: 3.5.1 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Make python streaming data source pyspark test also runs on spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47777) Add spark connect test for python streaming data source
[ https://issues.apache.org/jira/browse/SPARK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-4: - Fix Version/s: (was: 4.0.0) > Add spark connect test for python streaming data source > --- > > Key: SPARK-4 > URL: https://issues.apache.org/jira/browse/SPARK-4 > Project: Spark > Issue Type: Test > Components: PySpark, SS, Tests >Affects Versions: 3.5.1 >Reporter: Chaoqin Li >Assignee: Chaoqin Li >Priority: Major > Labels: pull-request-available > > Make python streaming data source pyspark test also runs on spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48135) Run `buf` and `ui` only in PR builders and Java 21 Daily CI
[ https://issues.apache.org/jira/browse/SPARK-48135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-48135: -- Summary: Run `buf` and `ui` only in PR builders and Java 21 Daily CI (was: Run `but` and `ui` only in PR builders and Java 21 Daily CI) > Run `buf` and `ui` only in PR builders and Java 21 Daily CI > --- > > Key: SPARK-48135 > URL: https://issues.apache.org/jira/browse/SPARK-48135 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-48073) StateStore schema incompatibility between 3.2 and 3.4
[ https://issues.apache.org/jira/browse/SPARK-48073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843583#comment-17843583 ] L. C. Hsieh commented on SPARK-48073: - The breaking change was introduced by https://github.com/apache/spark/pull/39615 > StateStore schema incompatibility between 3.2 and 3.4 > - > > Key: SPARK-48073 > URL: https://issues.apache.org/jira/browse/SPARK-48073 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.4 >Reporter: L. C. Hsieh >Priority: Major > > One our customer encountered some schema incompatibility problems when > upgrading from Spark 3.2 to 3.4 with structured streaming application. > It seems in 3.4 `Encoders.bean()` includes properties with only getter with > or without setter, whereas in 3.2, only properties with both getter and > setter are included. > For example, here are schemas for an AtomicLong property/field generated by > each version: > 3.2: > StructType(StructField(opaque,LongType,true),StructField(plain,LongType,true)) > 3.4: > StructType(StructField(acquire,LongType,false),StructField(andDecrement,LongType,false),StructField(andIncrement,LongType,false),StructField(opaque,LongType,false),StructField(plain,LongType,false)) > Note that the null ability flag also changes. > Primitive long schema has nullable=true in 3.2, but false in 3.4. > I am not sure if the issue is aware by the community before, and if there is > workaround for that? > Thanks. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47353) Mode (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843574#comment-17843574 ] Gideon P edited comment on SPARK-47353 at 5/5/24 6:30 PM: -- [~uros-db] Mode uses an accumulating OpenHashMap to determine the count of each unique element. Currently, the Apache Spark Mode function uses OpenHashMap to track occurrences of each key. However, with collation ordering (where multiple keys might compare as equal), using a direct hash map will not work effectively since different keys will need to be treated as the same. A few approaches to handle collations come to mind 1. Modify implementation `Mode.eval` to combine the map further. Perhaps by turning the map into a list of key-value tuples and folding. If the last element of the accumulating list and the current element being folded are equal according to collation, combine their counts 2. Another way to modify implementation `Mode.eval` to combine the map further would be to add all the elements of the buffer to a TreeMap with Comparator. A TreeMap can efficiently keep track of values and their counts in a sorted manner using a collation-sensitive comparator. 3. Use a TreeMap instead of OpenHashMap during the accumulation stage. Create a trait similar to TypedAggregateWithHashMapAsBuffer. Switch to use of this whenever both datatype of column is StringType and we are using a session collation. Would implement TypedImperativeAggregate. 4. Potentially using codegen fallback in this case would work. To start, I will try approach number 2. Please let me know if I am on the right track and if you have any ideas! was (Author: JIRAUSER304403): [~uros-db] Mode uses an accumulating OpenHashMap to determine the count of each unique element. Currently, the Apache Spark Mode function uses OpenHashMap to track occurrences of each key. However, with collation ordering (where multiple keys might compare as equal), using a direct hash map will not work effectively since different keys will need to be treated as the same. A few approaches to handle collations come to mind 1. Modify implementation `Mode.eval` to combine the map further. Perhaps by turning the map into a list of key-value tuples and folding. If the last element of the accumulating list and the current element being folded are equal according to collation, combine their counts 2. Another way to modify implementation `Mode.eval` to combine the map further would be to add all the elements of the buffer to a TreeMap with Comparator. A TreeMap can efficiently keep track of values and their counts in a sorted manner using a collation-sensitive comparator. 3. Use a TreeMap instead of OpenHashMap during the accumulation stage. Create a trait similar to TypedAggregateWithHashMapAsBuffer. Switch to use of this whenever both datatype of column is StringType and we are using a session collation. Would implement TypedImperativeAggregate. To start, I will try approach number 2. Please let me know if I am on the right track and if you have any ideas! > Mode (all collations) > - > > Key: SPARK-47353 > URL: https://issues.apache.org/jira/browse/SPARK-47353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *Mode* expression in Spark. First confirm > what is the expected behaviour for this expression when given collated > strings, then move on to the implementation that would enable handling > strings of all collation types. Implement the corresponding unit tests and > E2E SQL tests to reflect how this function should be used with collation in > SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *Mode* expression so it > supports all collation types currently supported in Spark. To understand what > changes were introduced in order to enable full collation support for other > existing functions in Spark, take a look at the Spark PRs and Jira tickets > for completed tasks in this parent (for example: Contains, StartsWith, > EndsWith). > Examples: > With UTF8_BINARY collation, the query > SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) > AS tab(col); > should return 'a'. > With UTF8_BINARY_LCASE collation, the query > SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) > AS tab(col); > should return either 'B' or 'b'. > > R
[jira] [Commented] (SPARK-48045) Pandas API groupby with multi-agg-relabel ignores as_index=False
[ https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843576#comment-17843576 ] Saidatt Sinai Amonkar commented on SPARK-48045: --- Opened a pull request to fix this: [GitHub Pull Request #46391|https://github.com/apache/spark/pull/46391] > Pandas API groupby with multi-agg-relabel ignores as_index=False > > > Key: SPARK-48045 > URL: https://issues.apache.org/jira/browse/SPARK-48045 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.5.1 > Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2 >Reporter: Paul George >Priority: Minor > Labels: pull-request-available > > A Pandas API DataFrame groupby with as_index=False and a multilevel > relabeling, such as > {code:java} > from pyspark import pandas as ps > ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", > as_index=False).agg(b_max=("b", "max")){code} > fails to include group keys in the resulting DataFrame. This diverges from > expected behavior as well as from the behavior of native Pandas, e.g. > *actual* > {code:java} > b_max > 0 1 {code} > *expected* > {code:java} > a b_max > 0 0 1 {code} > > A possible fix is to prepend groupby key columns to {{*order*}} and > {{*columns*}} before filtering here: > [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48045) Pandas API groupby with multi-agg-relabel ignores as_index=False
[ https://issues.apache.org/jira/browse/SPARK-48045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48045: --- Labels: pull-request-available (was: ) > Pandas API groupby with multi-agg-relabel ignores as_index=False > > > Key: SPARK-48045 > URL: https://issues.apache.org/jira/browse/SPARK-48045 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark >Affects Versions: 3.5.1 > Environment: Python 3.11, PySpark 3.5.1, Pandas=2.2.2 >Reporter: Paul George >Priority: Minor > Labels: pull-request-available > > A Pandas API DataFrame groupby with as_index=False and a multilevel > relabeling, such as > {code:java} > from pyspark import pandas as ps > ps.DataFrame({"a": [0, 0], "b": [0, 1]}).groupby("a", > as_index=False).agg(b_max=("b", "max")){code} > fails to include group keys in the resulting DataFrame. This diverges from > expected behavior as well as from the behavior of native Pandas, e.g. > *actual* > {code:java} > b_max > 0 1 {code} > *expected* > {code:java} > a b_max > 0 0 1 {code} > > A possible fix is to prepend groupby key columns to {{*order*}} and > {{*columns*}} before filtering here: > [https://github.com/apache/spark/blob/master/python/pyspark/pandas/groupby.py#L327-L328] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47353) Mode (all collations)
[ https://issues.apache.org/jira/browse/SPARK-47353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17843574#comment-17843574 ] Gideon P commented on SPARK-47353: -- [~uros-db] Mode uses an accumulating OpenHashMap to determine the count of each unique element. Currently, the Apache Spark Mode function uses OpenHashMap to track occurrences of each key. However, with collation ordering (where multiple keys might compare as equal), using a direct hash map will not work effectively since different keys will need to be treated as the same. A few approaches to handle collations come to mind 1. Modify implementation `Mode.eval` to combine the map further. Perhaps by turning the map into a list of key-value tuples and folding. If the last element of the accumulating list and the current element being folded are equal according to collation, combine their counts 2. Another way to modify implementation `Mode.eval` to combine the map further would be to add all the elements of the buffer to a TreeMap with Comparator. A TreeMap can efficiently keep track of values and their counts in a sorted manner using a collation-sensitive comparator. 3. Use a TreeMap instead of OpenHashMap during the accumulation stage. Create a trait similar to TypedAggregateWithHashMapAsBuffer. Switch to use of this whenever both datatype of column is StringType and we are using a session collation. Would implement TypedImperativeAggregate. To start, I will try approach number 2. Please let me know if I am on the right track and if you have any ideas! > Mode (all collations) > - > > Key: SPARK-47353 > URL: https://issues.apache.org/jira/browse/SPARK-47353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Priority: Major > > Enable collation support for the *Mode* expression in Spark. First confirm > what is the expected behaviour for this expression when given collated > strings, then move on to the implementation that would enable handling > strings of all collation types. Implement the corresponding unit tests and > E2E SQL tests to reflect how this function should be used with collation in > SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment > with the existing functions to learn more about how they work. In addition, > look into the possible use-cases and implementation of similar functions > within other other open-source DBMS, such as > [PostgreSQL|https://www.postgresql.org/docs/]. > > The goal for this Jira ticket is to implement the *Mode* expression so it > supports all collation types currently supported in Spark. To understand what > changes were introduced in order to enable full collation support for other > existing functions in Spark, take a look at the Spark PRs and Jira tickets > for completed tasks in this parent (for example: Contains, StartsWith, > EndsWith). > Examples: > With UTF8_BINARY collation, the query > SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) > AS tab(col); > should return 'a'. > With UTF8_BINARY_LCASE collation, the query > SELECT mode(col) FROM VALUES (‘a’), (‘a’), (‘a’), (‘B’), (‘B’), (‘b’), (‘b’) > AS tab(col); > should return either 'B' or 'b'. > > Read more about ICU [Collation Concepts|http://example.com/] and > [Collator|http://example.com/] class. Also, refer to the Unicode Technical > Standard for > [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48134) Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework
[ https://issues.apache.org/jira/browse/SPARK-48134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-48134: --- Labels: pull-request-available (was: ) > Spark core (java side): Migrate `error/warn/info` with variables to > structured logging framework > > > Key: SPARK-48134 > URL: https://issues.apache.org/jira/browse/SPARK-48134 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Critical > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48134) Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework
BingKun Pan created SPARK-48134: --- Summary: Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework Key: SPARK-48134 URL: https://issues.apache.org/jira/browse/SPARK-48134 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org