[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition
[ https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47633: -- Affects Version/s: 3.4.2 > Cache miss for queries using JOIN LATERAL with join condition > - > > Key: SPARK-47633 > URL: https://issues.apache.org/jira/browse/SPARK-47633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 4.0.0, 3.5.1 >Reporter: Bruce Robbins >Priority: Major > > For example: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v1 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2) > on c1 = a; > cache table v1; > explain select * from v1; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false >:- LocalTableScan [c1#180, c2#181] >+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [plan_id=113] > +- LocalTableScan [a#173, b#174] > {noformat} > Note that there is no {{InMemoryRelation}}. > However, if you move the join condition into the subquery, the cached plan is > used: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v2 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2 > where t1.c1 = t2.c1); > cache table v2; > explain select * from v2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179] > +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > *(1) Project [c1#26, c2#27, a#19, b#20] > +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, > BuildLeft, false > :- BroadcastQueryStage 0 > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- *(1) LocalTableScan [a#19, b#20, c1#30] >+- == Initial Plan == > Project [c1#26, c2#27, a#19, b#20] > +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, > false > :- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- LocalTableScan [a#19, b#20, c1#30] > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition
[ https://issues.apache.org/jira/browse/SPARK-47633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47633: -- Affects Version/s: 3.5.1 > Cache miss for queries using JOIN LATERAL with join condition > - > > Key: SPARK-47633 > URL: https://issues.apache.org/jira/browse/SPARK-47633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1 >Reporter: Bruce Robbins >Priority: Major > > For example: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v1 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2) > on c1 = a; > cache table v1; > explain select * from v1; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false >:- LocalTableScan [c1#180, c2#181] >+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint)),false), [plan_id=113] > +- LocalTableScan [a#173, b#174] > {noformat} > Note that there is no {{InMemoryRelation}}. > However, if you move the join condition into the subquery, the cached plan is > used: > {noformat} > CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); > CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); > create or replace temp view v2 as > select * > from t1 > join lateral ( > select c1 as a, c2 as b > from t2 > where t1.c1 = t2.c1); > cache table v2; > explain select * from v2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179] > +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, > memory, deserialized, 1 replicas) > +- AdaptiveSparkPlan isFinalPlan=true >+- == Final Plan == > *(1) Project [c1#26, c2#27, a#19, b#20] > +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, > BuildLeft, false > :- BroadcastQueryStage 0 > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- *(1) LocalTableScan [a#19, b#20, c1#30] >+- == Initial Plan == > Project [c1#26, c2#27, a#19, b#20] > +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, > false > :- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, false] as > bigint)),false), [plan_id=37] > : +- LocalTableScan [c1#26, c2#27] > +- LocalTableScan [a#19, b#20, c1#30] > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47633) Cache miss for queries using JOIN LATERAL with join condition
Bruce Robbins created SPARK-47633: - Summary: Cache miss for queries using JOIN LATERAL with join condition Key: SPARK-47633 URL: https://issues.apache.org/jira/browse/SPARK-47633 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Bruce Robbins For example: {noformat} CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); create or replace temp view v1 as select * from t1 join lateral ( select c1 as a, c2 as b from t2) on c1 = a; cache table v1; explain select * from v1; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- BroadcastHashJoin [c1#180], [a#173], Inner, BuildRight, false :- LocalTableScan [c1#180, c2#181] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=113] +- LocalTableScan [a#173, b#174] {noformat} Note that there is no {{InMemoryRelation}}. However, if you move the join condition into the subquery, the cached plan is used: {noformat} CREATE or REPLACE TEMP VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); CREATE or REPLACE TEMP VIEW t2(c1, c2) AS VALUES (0, 1), (1, 2); create or replace temp view v2 as select * from t1 join lateral ( select c1 as a, c2 as b from t2 where t1.c1 = t2.c1); cache table v2; explain select * from v2; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Scan In-memory table v2 [c1#176, c2#177, a#178, b#179] +- InMemoryRelation [c1#176, c2#177, a#178, b#179], StorageLevel(disk, memory, deserialized, 1 replicas) +- AdaptiveSparkPlan isFinalPlan=true +- == Final Plan == *(1) Project [c1#26, c2#27, a#19, b#20] +- *(1) BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, false :- BroadcastQueryStage 0 : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=37] : +- LocalTableScan [c1#26, c2#27] +- *(1) LocalTableScan [a#19, b#20, c1#30] +- == Initial Plan == Project [c1#26, c2#27, a#19, b#20] +- BroadcastHashJoin [c1#26], [c1#30], Inner, BuildLeft, false :- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), [plan_id=37] : +- LocalTableScan [c1#26, c2#27] +- LocalTableScan [a#19, b#20, c1#30] {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47527) Cache miss for queries using With expressions
[ https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins resolved SPARK-47527. --- Resolution: Duplicate > Cache miss for queries using With expressions > - > > Key: SPARK-47527 > URL: https://issues.apache.org/jira/browse/SPARK-47527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Bruce Robbins >Priority: Major > Labels: pull-request-available > > For example: > {noformat} > create or replace temp view v1 as > select id from range(10); > create or replace temp view q1 as > select * from v1 > where id between 2 and 4; > cache table q1; > explain select * from q1; > == Physical Plan == > *(1) Filter ((id#51L >= 2) AND (id#51L <= 4)) > +- *(1) Range (0, 10, step=1, splits=8) > {noformat} > Similarly: > {noformat} > create or replace temp view q2 as > select count_if(id > 3) as cnt > from v1; > cache table q2; > explain select * from q2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null > else _common_expr_0#88)]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182] > +- HashAggregate(keys=[], functions=[partial_count(if (NOT > _common_expr_0#88) null else _common_expr_0#88)]) > +- Project [(id#86L > 3) AS _common_expr_0#88] > +- Range (0, 10, step=1, splits=8) > {noformat} > In the output of the above explain commands, neither include an > {{InMemoryRelation}} node. > The culprit seems to be the common expression ids in the {{With}} expressions > used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this > code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47527) Cache misses for queries using With expressions
Bruce Robbins created SPARK-47527: - Summary: Cache misses for queries using With expressions Key: SPARK-47527 URL: https://issues.apache.org/jira/browse/SPARK-47527 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Bruce Robbins For example: {noformat} create or replace temp view v1 as select id from range(10); create or replace temp view q1 as select * from v1 where id between 2 and 4; cache table q1; explain select * from q1; == Physical Plan == *(1) Filter ((id#51L >= 2) AND (id#51L <= 4)) +- *(1) Range (0, 10, step=1, splits=8) {noformat} Similarly: {noformat} create or replace temp view q2 as select count_if(id > 3) as cnt from v1; cache table q2; explain select * from q2; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else _common_expr_0#88)]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182] +- HashAggregate(keys=[], functions=[partial_count(if (NOT _common_expr_0#88) null else _common_expr_0#88)]) +- Project [(id#86L > 3) AS _common_expr_0#88] +- Range (0, 10, step=1, splits=8) {noformat} In the output of the above explain commands, neither list an {{InMemoryRelation}} node. The culprit seems to be the common expression ids in the {{With}} expressions used in runtime replacements for {{between}} and {{count_if}}, e.g. [this code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47527) Cache miss for queries using With expressions
[ https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47527: -- Description: For example: {noformat} create or replace temp view v1 as select id from range(10); create or replace temp view q1 as select * from v1 where id between 2 and 4; cache table q1; explain select * from q1; == Physical Plan == *(1) Filter ((id#51L >= 2) AND (id#51L <= 4)) +- *(1) Range (0, 10, step=1, splits=8) {noformat} Similarly: {noformat} create or replace temp view q2 as select count_if(id > 3) as cnt from v1; cache table q2; explain select * from q2; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else _common_expr_0#88)]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182] +- HashAggregate(keys=[], functions=[partial_count(if (NOT _common_expr_0#88) null else _common_expr_0#88)]) +- Project [(id#86L > 3) AS _common_expr_0#88] +- Range (0, 10, step=1, splits=8) {noformat} In the output of the above explain commands, neither include an {{InMemoryRelation}} node. The culprit seems to be the common expression ids in the {{With}} expressions used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43]. was: For example: {noformat} create or replace temp view v1 as select id from range(10); create or replace temp view q1 as select * from v1 where id between 2 and 4; cache table q1; explain select * from q1; == Physical Plan == *(1) Filter ((id#51L >= 2) AND (id#51L <= 4)) +- *(1) Range (0, 10, step=1, splits=8) {noformat} Similarly: {noformat} create or replace temp view q2 as select count_if(id > 3) as cnt from v1; cache table q2; explain select * from q2; == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null else _common_expr_0#88)]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182] +- HashAggregate(keys=[], functions=[partial_count(if (NOT _common_expr_0#88) null else _common_expr_0#88)]) +- Project [(id#86L > 3) AS _common_expr_0#88] +- Range (0, 10, step=1, splits=8) {noformat} In the output of the above explain commands, neither list an {{InMemoryRelation}} node. The culprit seems to be the common expression ids in the {{With}} expressions used in runtime replacements for {{between}} and {{count_if}}, e.g. [this code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43]. > Cache miss for queries using With expressions > - > > Key: SPARK-47527 > URL: https://issues.apache.org/jira/browse/SPARK-47527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Bruce Robbins >Priority: Major > > For example: > {noformat} > create or replace temp view v1 as > select id from range(10); > create or replace temp view q1 as > select * from v1 > where id between 2 and 4; > cache table q1; > explain select * from q1; > == Physical Plan == > *(1) Filter ((id#51L >= 2) AND (id#51L <= 4)) > +- *(1) Range (0, 10, step=1, splits=8) > {noformat} > Similarly: > {noformat} > create or replace temp view q2 as > select count_if(id > 3) as cnt > from v1; > cache table q2; > explain select * from q2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null > else _common_expr_0#88)]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182] > +- HashAggregate(keys=[], functions=[partial_count(if (NOT > _common_expr_0#88) null else _common_expr_0#88)]) > +- Project [(id#86L > 3) AS _common_expr_0#88] > +- Range (0, 10, step=1, splits=8) > {noformat} > In the output of the above explain commands, neither include an > {{InMemoryRelation}} node. > The culprit seems to be the common expression ids in the {{With}} expressions > used in runtime replacements for {{between}} and {{{}count_if{}}}, e.g. [this > code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47527) Cache miss for queries using With expressions
[ https://issues.apache.org/jira/browse/SPARK-47527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47527: -- Summary: Cache miss for queries using With expressions (was: Cache misses for queries using With expressions) > Cache miss for queries using With expressions > - > > Key: SPARK-47527 > URL: https://issues.apache.org/jira/browse/SPARK-47527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Bruce Robbins >Priority: Major > > For example: > {noformat} > create or replace temp view v1 as > select id from range(10); > create or replace temp view q1 as > select * from v1 > where id between 2 and 4; > cache table q1; > explain select * from q1; > == Physical Plan == > *(1) Filter ((id#51L >= 2) AND (id#51L <= 4)) > +- *(1) Range (0, 10, step=1, splits=8) > {noformat} > Similarly: > {noformat} > create or replace temp view q2 as > select count_if(id > 3) as cnt > from v1; > cache table q2; > explain select * from q2; > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- HashAggregate(keys=[], functions=[count(if (NOT _common_expr_0#88) null > else _common_expr_0#88)]) >+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=182] > +- HashAggregate(keys=[], functions=[partial_count(if (NOT > _common_expr_0#88) null else _common_expr_0#88)]) > +- Project [(id#86L > 3) AS _common_expr_0#88] > +- Range (0, 10, step=1, splits=8) > {noformat} > In the output of the above explain commands, neither list an > {{InMemoryRelation}} node. > The culprit seems to be the common expression ids in the {{With}} expressions > used in runtime replacements for {{between}} and {{count_if}}, e.g. [this > code|https://github.com/apache/spark/blob/39500a315166d8e342b678ef3038995a03ce84d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Between.scala#L43]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-47193) Converting dataframe to rdd results in data loss
[ https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821393#comment-17821393 ] Bruce Robbins edited comment on SPARK-47193 at 2/27/24 8:48 PM: Running this in Spark 3.5.0 in local mode on my laptop, I get {noformat} df count = 8 ... rdd count = 8 {noformat} What is your environment and Spark configuration? By the way, the "{{...}}" above are messages like {noformat} 24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: UserId, LocationId, LocationName, CreatedDate, Status Schema: UserId, LocationId, LocationName, Status, CreatedDate Expected: Status but found: CreatedDate CSV file: file:userLocation.csv {noformat} was (Author: bersprockets): Running this in Spark 3.5.0 in local mode on my laptop, I get {noformat} df count = 8 ... rdd count = 8 {noformat} What is your environment and Spark configuration? By the way, the {{...}} above are messages like {noformat} 24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: UserId, LocationId, LocationName, CreatedDate, Status Schema: UserId, LocationId, LocationName, Status, CreatedDate Expected: Status but found: CreatedDate CSV file: file:userLocation.csv {noformat} > Converting dataframe to rdd results in data loss > > > Key: SPARK-47193 > URL: https://issues.apache.org/jira/browse/SPARK-47193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: Ivan Bova >Priority: Critical > Labels: correctness > Attachments: device.csv, deviceClass.csv, deviceType.csv, > language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, > userLocation.csv, userProfile.csv > > > I have 10 csv files and need to create mapping from them. After all of the > joins dataframe contains all expected rows but rdd from this dataframe > contains only half of them. > {code:java} > case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: > String, LastName: String, LanguageId: Option[Int]) > case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String) > case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], > UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: > Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: > Option[Int]) > case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String) > case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String) > case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: > Option[Double], Longitude: Option[Double], Radius: Option[Double], > CreatedDate: Timestamp) > case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String) > case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: > String, Status: Int, CreatedDate: Timestamp) > case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: > Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp]) > case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], > Address1: String, Address2: String, City: String, State: String, Country: > String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: > Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, > UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: > Option[Boolean], Level: Option[Int], TimeZone: Option[Int]) > val userProfile = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage] > val language = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage] > val device = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage] > val deviceClass = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage] > val deviceType = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage] > val location1 = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1] > val
[jira] [Commented] (SPARK-47193) Converting dataframe to rdd results in data loss
[ https://issues.apache.org/jira/browse/SPARK-47193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17821393#comment-17821393 ] Bruce Robbins commented on SPARK-47193: --- Running this in Spark 3.5.0 in local mode on my laptop, I get {noformat} df count = 8 ... rdd count = 8 {noformat} What is your environment and Spark configuration? By the way, the {{...}} above are messages like {noformat} 24/02/27 11:34:51 WARN CSVHeaderChecker: CSV header does not conform to the schema. Header: UserId, LocationId, LocationName, CreatedDate, Status Schema: UserId, LocationId, LocationName, Status, CreatedDate Expected: Status but found: CreatedDate CSV file: file:userLocation.csv {noformat} > Converting dataframe to rdd results in data loss > > > Key: SPARK-47193 > URL: https://issues.apache.org/jira/browse/SPARK-47193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0, 3.5.1 >Reporter: Ivan Bova >Priority: Critical > Labels: correctness > Attachments: device.csv, deviceClass.csv, deviceType.csv, > language.csv, location.csv, location1.csv, timeZoneLookup.csv, user.csv, > userLocation.csv, userProfile.csv > > > I have 10 csv files and need to create mapping from them. After all of the > joins dataframe contains all expected rows but rdd from this dataframe > contains only half of them. > {code:java} > case class MyUserProfileMessage(UserId: Int, Email: String, FirstName: > String, LastName: String, LanguageId: Option[Int]) > case class MyLanguageMessage(LanguageId: Int, LanguageLocaleId: String) > case class MyDeviceMessage(DeviceId1: String, Created: Option[Timestamp], > UpdatedDate: Timestamp, DeviceId2: String, DeviceName: String, LocationId: > Option[Int], DeviceTypeId: Option[Int], DeviceClassId: Int, UserId1: > Option[Int]) > case class MyDeviceClassMessage(DeviceClassId: Int, DeviceClassName: String) > case class MyDeviceTypeMessage(DeviceTypeId: Int, DeviceTypeName: String) > case class MyLocation1(LocationId1: Int, LocationId: Int, Latitude: > Option[Double], Longitude: Option[Double], Radius: Option[Double], > CreatedDate: Timestamp) > case class MyTimeZoneLookupMessage(TimeZoneId: Int, ZoneName: String) > case class MyUserLocationMessage(UserId: Int, LocationId: Int, LocationName: > String, Status: Int, CreatedDate: Timestamp) > case class MyUserMessage(UserId: Int, Created: Option[Timestamp], Deleted: > Option[Timestamp], Active: Option[Boolean], ActivatedDate: Option[Timestamp]) > case class MyLocationMessage(LocationId: Int, IsDeleted: Option[Boolean], > Address1: String, Address2: String, City: String, State: String, Country: > String, ZipCode: String, Feature2Enabled: Option[Boolean], LocationStatus: > Option[Int], Location1Enabled: Option[Boolean], LocationKey: String, > UpdatedDateTime: Timestamp, CreatedDate: Timestamp, Feature1Enabled: > Option[Boolean], Level: Option[Int], TimeZone: Option[Int]) > val userProfile = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserProfileMessage].schema).csv("userProfile.csv").as[MyUserProfileMessage] > val language = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLanguageMessage].schema).csv("language.csv").as[MyLanguageMessage] > val device = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceMessage].schema).csv("device.csv").as[MyDeviceMessage] > val deviceClass = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceClassMessage].schema).csv("deviceClass.csv").as[MyDeviceClassMessage] > val deviceType = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyDeviceTypeMessage].schema).csv("deviceType.csv").as[MyDeviceTypeMessage] > val location1 = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyLocation1].schema).csv("location1.csv").as[MyLocation1] > val timeZoneLookup = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyTimeZoneLookupMessage].schema).csv("timeZoneLookup.csv").as[MyTimeZoneLookupMessage] > val userLocation = spark.read.option("header", "true").option("comment", > "#").option("nullValue", > "null").schema(Encoders.product[MyUserLocationMessage].schema).csv("userLocation.csv").as[MyUserLocationMessage] > val user = spark.read.option("header", "true").option("comment", > "#").option("nullValue", >
[jira] [Commented] (SPARK-47134) Unexpected nulls when casting decimal values in specific cases
[ https://issues.apache.org/jira/browse/SPARK-47134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819789#comment-17819789 ] Bruce Robbins commented on SPARK-47134: --- Oddly, I cannot reproduce on either 3.4.1 or 3.5.0. Also, my 3.4.1 plan doesn't look like your 3.4.1 plan: My plan uses {{sum}}, your plan uses {{decimalsum}}. I can't find where {{decimalsum}} comes from in the code base, but maybe I am not looking hard enough. {noformat} scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", x)).toDS ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> ds.createOrReplaceTempView("t") scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").show() ++ | ct| ++ | 9508.00| |13879.00| ++ scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct FROM t GROUP BY `_1` ORDER BY ct ASC").explain == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [ct#19 ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(ct#19 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=68] +- HashAggregate(keys=[_1#2], functions=[sum(1.00)]) +- Exchange hashpartitioning(_1#2, 200), ENSURE_REQUIREMENTS, [plan_id=65] +- HashAggregate(keys=[_1#2], functions=[partial_sum(1.00)]) +- LocalTableScan [_1#2] scala> sql("select version()").show(false) +--+ |version() | +--+ |3.4.1 6b1ff22dde1ead51cbf370be6e48a802daae58b6| +--+ scala> {noformat} > Unexpected nulls when casting decimal values in specific cases > -- > > Key: SPARK-47134 > URL: https://issues.apache.org/jira/browse/SPARK-47134 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Dylan Walker >Priority: Major > Attachments: 321queryplan.txt, 341queryplan.txt > > > In specific cases, casting decimal values can result in `null` values where > no overflow exists. > The cases appear very specific, and I don't have the depth of knowledge to > generalize this issue, so here is a simple spark-shell reproduction: > *Setup:* > {code:scala} > scala> val ds = 0.to(23386).map(x => if (x > 13878) ("A", x) else ("B", > x)).toDS > ds: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] > scala> ds.createOrReplaceTempView("t") > {code} > > *Spark 3.2.1 behaviour (correct):* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > ++ > | ct| > ++ > | 9508.00| > |13879.00| > ++ > {code} > *Spark 3.4.1 / Spark 3.5.0 behaviour:* > {code:scala} > scala> spark.sql("select CAST(SUM(1.00) AS DECIMAL(28,14)) as ct > FROM t GROUP BY `_1` ORDER BY ct ASC").show() > +---+ > | ct| > +---+ > | null| > |9508.00| > +---+ > {code} > This is fairly delicate: > - removing the {{ORDER BY}} clause produces the correct result > - removing the {{CAST}} produces the correct result > - changing the number of 0s in the argument to {{SUM}} produces the correct > result > - setting {{spark.ansi.enabled}} to {{true}} produces the correct result > (and does not throw an error) > Also, removing the {{ORDER BY}}, but writing {{ds}} to a parquet will also > result in the unexpected nulls. > Please let me know if you need additional information. > We are also interested in understanding whether setting > {{spark.ansi.enabled}} can be considered a reliable workaround to this issue > prior to a fix being released, if possible. > Text files that include {{explain()}} output attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-47104: -- Affects Version/s: 3.5.0 3.4.2 > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1, 3.4.2, 3.5.0 >Reporter: Chhavi Bansal >Priority: Major > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) > at org.apache.spark.sql.Dataset.show(Dataset.scala:808) > at org.apache.spark.sql.Dataset.show(Dataset.scala:785) > at > hyperspace2.sparkPlan$.delayedEndpoint$hyperspace2$sparkPlan$1(sparkPlan.scala:14) > at hyperspace2.sparkPlan$delayedInit$body.apply(sparkPlan.scala:6) > at scala.Function0.apply$mcV$sp(Function0.scala:39) > at scala.Function0.apply$mcV$sp$(Function0.scala:39) > at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17) > at scala.App.$anonfun$main$1$adapted(App.scala:80) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.App.main(App.scala:80) > at scala.App.main$(App.scala:78) > at hyperspace2.sparkPlan$.main(sparkPlan.scala:6) > at hyperspace2.sparkPlan.main(sparkPlan.scala) {code} > Note: > # here if I remove order by clause then it produces the correct output. > # This happens when I read the dataset using csv file, works fine if I make > the dataframe using Seq().toDf > # The query fails if I use spark.sql("query").show() but is success when I > simple write it to csv file > [https://stackoverflow.com/questions/78020267/spark-sql-query-fails-with-nullpointerexception] > Please can someone look into why this happens just when using `show()` since > this is failing queries in production for me. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (SPARK-47104) Spark SQL query fails with NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-47104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17818934#comment-17818934 ] Bruce Robbins commented on SPARK-47104: --- It's not a CSV specific issue. You can reproduce with a cached view. The following fails on the master branch, when using {{spark-sql}}: {noformat} create or replace temp view v1(id, name) as values (1, "fred"), (2, "bob"); cache table v1; select name, uuid() as _iid from ( select s.name from v1 s join v1 t on s.name = t.name order by name ) limit 20; {noformat} The exception is: {noformat} java.lang.NullPointerException: Cannot invoke "org.apache.spark.sql.catalyst.util.RandomUUIDGenerator.getNextUUIDUTF8String()" because "this.randomGen_0" is null at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$6(limit.scala:297) at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:934) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$1(limit.scala:297) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:246) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:243) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:286) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:390) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:418) at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:390) {noformat} It seems that non-deterministic expressions are not getting initialized before being used in the unsafe projection. I can take a look. > Spark SQL query fails with NullPointerException > --- > > Key: SPARK-47104 > URL: https://issues.apache.org/jira/browse/SPARK-47104 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Chhavi Bansal >Priority: Major > > I am trying to run a very simple SQL query involving join and orderby clause > and then using UUID() function in the outermost select stmt. The query fails > {code:java} > val df = spark.read.format("csv").option("header", > "true").load("src/main/resources/titanic.csv") > df.createOrReplaceTempView("titanic") > val query = spark.sql(" select name, uuid() as _iid from (select s.name from > titanic s join titanic t on s.name = t.name order by name) ;") > query.show() // FAILS{code} > Dataset is a normal csv file with the following columns > {code:java} > PassengerId,Survived,Pclass,Name,Gender,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked > {code} > Below is the error > {code:java} > Exception in thread "main" java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.$anonfun$executeCollect$2(limit.scala:207) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:207) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$executeCollect$1(AdaptiveSparkPlanExec.scala:338) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:366) > at > org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:338) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at >
[jira] [Commented] (SPARK-47034) join between cached temp tables result in missing entries
[ https://issues.apache.org/jira/browse/SPARK-47034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817123#comment-17817123 ] Bruce Robbins commented on SPARK-47034: --- I wonder if this is SPARK-45592 (and, relatedly, SPARK-45282), which existed as a bug in 3.5.0 but is fixed on master and branch-3.5. > join between cached temp tables result in missing entries > - > > Key: SPARK-47034 > URL: https://issues.apache.org/jira/browse/SPARK-47034 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 3.5.0 >Reporter: shurik mermelshtein >Priority: Major > > we create several temp tables (views) by loading several delta tables and > joining between them. > those views are used for calculation of different metrics. each metric > requires different views to be used. some of the more popular views are > cached for better performance. > we have noticed that once we upgraded from spark 3.4.2 to spark 3.5.0 some > of the join started to fail. > we can reproduce a case were we have 2 data frames (views) (this is not the > real names / values we use. this is just for the example) > # users with the column user_id, campaign_id, user_name. > we make sure it has a single entry > '11', '2', 'Jhon Doe' > # actions with the column user_id, campaign_id, action_id, action count > we make sure it has a single entry > '11', '2', 'clicks', 5 > > # users view can be filtered for user_id = '11' or/and campaign_id = > '2' and it will find the existing single row > # actions view can be filtered for user_id = '11' or/and campaign_id = > '2' and it will find the existing single row > # users and actions can be inner join by user_id *OR* campaign_id and the > join will be successful. > # users and actions can *not* be inner join by user_id *AND* campaign_id. > The join results in no entry. > # if we write both of the views to S3 and read them back to new data frames, > suddenly the join is working. > # if we disable AQE the join is working > # running checkpoint on the views does not make join #4 work -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47019) AQE dynamic cache partitioning causes SortMergeJoin to result in data loss
[ https://issues.apache.org/jira/browse/SPARK-47019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816321#comment-17816321 ] Bruce Robbins commented on SPARK-47019: --- I can reproduce on my laptop using Spark 3.5.0 and {{--master "local-cluster[3,1,1024]"}}. However, I can not reproduce on the latest branch-3.5 or master. So it seems to have been fixed, probably by SPARK-45592. > AQE dynamic cache partitioning causes SortMergeJoin to result in data loss > -- > > Key: SPARK-47019 > URL: https://issues.apache.org/jira/browse/SPARK-47019 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.5.0 > Environment: Tested in 3.5.0 > Reproduced on, so far: > * kubernetes deployment > * docker cluster deployment > Local Cluster: > * master > * worker1 (2/2G) > * worker2 (1/1G) >Reporter: Ridvan Appa Bugis >Priority: Blocker > Labels: DAG, caching, correctness, data-loss, > dynamic_allocation, inconsistency, partitioning > Attachments: Screenshot 2024-02-07 at 20.09.44.png, Screenshot > 2024-02-07 at 20.10.07.png, eventLogs-app-20240207175940-0023.zip, > testdata.zip > > > It seems like we have encountered an issue with Spark AQE's dynamic cache > partitioning which causes incorrect *count* output values and data loss. > A similar issue could not be found, so i am creating this ticket to raise > awareness. > > Preconditions: > - Setup a cluster as per environment specification > - Prepare test data (or a data large enough to trigger read by both > executors) > Steps to reproduce: > - Read parent > - Self join parent > - cache + materialize parent > - Join parent with child > > Performing a self-join over a parentDF, then caching + materialising the DF, > and then joining it with a childDF results in *incorrect* count value and > {*}missing data{*}. > > Performing a *repartition* seems to fix the issue, most probably due to > rearrangement of the underlying partitions and statistic update. > > This behaviour is observed over a multi-worker cluster with a job running 2 > executors (1 per worker), when reading a large enough data file by both > executors. > Not reproducible in local mode. > > Circumvention: > So far, by disabling > _spark.sql.optimizer.canChangeCachedPlanOutputPartitioning_ or performing > repartition this can be alleviated, but it is not the fix of the root cause. > > This issue is dangerous considering that data loss is occurring silently and > in absence of proper checks can lead to wrong behaviour/results down the > line. So we have labeled it as a blocker. > > There seems to be a file-size treshold after which dataloss is observed > (possibly implying that it happens when both executors start reading the data > file) > > Minimal example: > {code:java} > // Read parent > val parentData = session.read.format("avro").load("/data/shared/test/parent") > // Self join parent and cache + materialize > val parent = parentData.join(parentData, Seq("PID")).cache() > parent.count() > // Read child > val child = session.read.format("avro").load("/data/shared/test/child") > // Basic join > val resultBasic = child.join( > parent, > parent("PID") === child("PARENT_ID") > ) > // Count: 16479 (Wrong) > println(s"Count no repartition: ${resultBasic.count()}") > // Repartition parent join > val resultRepartition = child.join( > parent.repartition(), > parent("PID") === child("PARENT_ID") > ) > // Count: 50094 (Correct) > println(s"Count with repartition: ${resultRepartition.count()}") {code} > > Invalid count-only DAG: > !Screenshot 2024-02-07 at 20.10.07.png|width=519,height=853! > Valid repartition DAG: > !Screenshot 2024-02-07 at 20.09.44.png|width=368,height=1219! > > Spark submit for this job: > {code:java} > spark-submit > --class ExampleApp > --packages org.apache.spark:spark-avro_2.12:3.5.0 > --deploy-mode cluster > --master spark://spark-master:6066 > --conf spark.sql.autoBroadcastJoinThreshold=-1 > --conf spark.cores.max=3 > --driver-cores 1 > --driver-memory 1g > --executor-cores 1 > --executor-memory 1g > /path/to/test.jar > {code} > The cluster should be setup to the following (worker1(m+e) worker2(e)) as to > split the executors onto two workers. > I have prepared a simple github repository which contains the compilable > above example. > [https://github.com/ridvanappabugis/spark-3.5-issue] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46779) Grouping by subquery with a cached relation can fail
[ https://issues.apache.org/jira/browse/SPARK-46779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-46779: -- Description: Example: {noformat} create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; {noformat} It fails with the following error: {noformat} [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 {noformat} If you don't cache the view, the query succeeds. Note, in 3.4.2 and 3.5.0 the issue happens only with cached tables, not cached views. I think that's because cached views were not getting properly deduplicated in those versions. was: Example: {noformat} create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; {noformat} It fails with the following error: {noformat} [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 {noformat} If you don't cache the view, the query succeeds. > Grouping by subquery with a cached relation can fail > > > Key: SPARK-46779 > URL: https://issues.apache.org/jira/browse/SPARK-46779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Bruce Robbins >Priority: Major > Labels: pull-request-available > > Example: > {noformat} > create or replace temp view data(c1, c2) as values > (1, 2), > (1, 3), > (3, 7), > (4, 5); > cache table data; > select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from > data d2 group by all; > {noformat} > It fails with the following error: > {noformat} > [INTERNAL_ERROR] Couldn't find count(1)#163L in > [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 > org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L > in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 > {noformat} > If you don't cache the view, the query succeeds. > Note, in 3.4.2 and 3.5.0 the issue happens only with cached tables, not > cached views. I think that's because cached views were not getting properly > deduplicated in those versions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46779) Grouping by subquery with a cached relation can fail
[ https://issues.apache.org/jira/browse/SPARK-46779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-46779: -- Affects Version/s: 3.5.0 3.4.2 > Grouping by subquery with a cached relation can fail > > > Key: SPARK-46779 > URL: https://issues.apache.org/jira/browse/SPARK-46779 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Bruce Robbins >Priority: Major > Labels: pull-request-available > > Example: > {noformat} > create or replace temp view data(c1, c2) as values > (1, 2), > (1, 3), > (3, 7), > (4, 5); > cache table data; > select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from > data d2 group by all; > {noformat} > It fails with the following error: > {noformat} > [INTERNAL_ERROR] Couldn't find count(1)#163L in > [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 > org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L > in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 > {noformat} > If you don't cache the view, the query succeeds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46779) Grouping by subquery with a cached relation can fail
Bruce Robbins created SPARK-46779: - Summary: Grouping by subquery with a cached relation can fail Key: SPARK-46779 URL: https://issues.apache.org/jira/browse/SPARK-46779 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Bruce Robbins Example: {noformat} create or replace temp view data(c1, c2) as values (1, 2), (1, 3), (3, 7), (4, 5); cache table data; select c1, (select count(*) from data d1 where d1.c1 = d2.c1), count(c2) from data d2 group by all; {noformat} It fails with the following error: {noformat} [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find count(1)#163L in [c1#78,_groupingexpression#149L,count(1)#82L] SQLSTATE: XX000 {noformat} If you don't cache the view, the query succeeds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46373) Create DataFrame Bug
[ https://issues.apache.org/jira/browse/SPARK-46373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796385#comment-17796385 ] Bruce Robbins commented on SPARK-46373: --- Maybe due to this (from [the docs|https://spark.apache.org/docs/3.5.0/]): {quote}Spark runs on Java 8/11/17, Scala 2.12/2.13, Python 3.8+, and R 3.5+.{quote} Scala 3 is not listed as a supported version. > Create DataFrame Bug > > > Key: SPARK-46373 > URL: https://issues.apache.org/jira/browse/SPARK-46373 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Bleibtreu >Priority: Major > > Scala version is 3.3.1 > Spark version is 3.5.0 > I am using spark-core 3.5.1. I am trying to create a DataFrame through the > reflection api, but "No TypeTag available for Person" will appear. I have > tried for a long time, but I still don't quite understand why TypeTag cannot > recognize my Person case class. > {code:java} > import sparkSession.implicits._ > import scala.reflect.runtime.universe._ > case class Person(name: String) > val a = List(Person("A"), Person("B"), Person("C")) > val df = sparkSession.createDataFrame(a) > df.show(){code} > !https://media.discordapp.net/attachments/839723072239566878/1183747749204725821/image.png?ex=65897600=65770100=4eeba8d8499499439590a34260f8b441c6594c572c545f5f61f8dc65beeb6a4b&==webp=lossless=1178=142! > I tested it and it is indeed a problem unique to Scala3 > There is no problem on Scala2.13 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46289) Exception when ordering by UDT in interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-46289: -- Priority: Minor (was: Major) > Exception when ordering by UDT in interpreted mode > -- > > Key: SPARK-46289 > URL: https://issues.apache.org/jira/browse/SPARK-46289 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.2, 3.5.0 >Reporter: Bruce Robbins >Priority: Minor > > In interpreted mode, ordering by a UDT will result in an exception. For > example: > {noformat} > import org.apache.spark.ml.linalg.{DenseVector, Vector} > val df = Seq.tabulate(30) { x => > (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + > 1)/100.0).toDouble, ((x + 3)/100.0).toDouble))) > }.toDF("id", "c1", "c2", "c3") > df.createOrReplaceTempView("df") > // this works > sql("select * from df order by c3").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this gets an error > sql("select * from df order by c3").collect > {noformat} > The second {{collect}} action results in the following exception: > {noformat} > org.apache.spark.SparkIllegalArgumentException: Type > UninitializedPhysicalType does not support ordered operations. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) > {noformat} > Note: You don't get an error if you use {{show}} rather than {{collect}}. > This is because {{show}} will implicitly add a {{limit}}, in which case the > ordering is performed by {{TakeOrderedAndProject}} rather than > {{UnsafeExternalRowSorter}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46289) Exception when ordering by UDT in interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-46289: -- Affects Version/s: 3.3.3 > Exception when ordering by UDT in interpreted mode > -- > > Key: SPARK-46289 > URL: https://issues.apache.org/jira/browse/SPARK-46289 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.2, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > In interpreted mode, ordering by a UDT will result in an exception. For > example: > {noformat} > import org.apache.spark.ml.linalg.{DenseVector, Vector} > val df = Seq.tabulate(30) { x => > (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + > 1)/100.0).toDouble, ((x + 3)/100.0).toDouble))) > }.toDF("id", "c1", "c2", "c3") > df.createOrReplaceTempView("df") > // this works > sql("select * from df order by c3").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this gets an error > sql("select * from df order by c3").collect > {noformat} > The second {{collect}} action results in the following exception: > {noformat} > org.apache.spark.SparkIllegalArgumentException: Type > UninitializedPhysicalType does not support ordered operations. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) > {noformat} > Note: You don't get an error if you use {{show}} rather than {{collect}}. > This is because {{show}} will implicitly add a {{limit}}, in which case the > ordering is performed by {{TakeOrderedAndProject}} rather than > {{UnsafeExternalRowSorter}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46289) Exception when ordering by UDT in interpreted mode
Bruce Robbins created SPARK-46289: - Summary: Exception when ordering by UDT in interpreted mode Key: SPARK-46289 URL: https://issues.apache.org/jira/browse/SPARK-46289 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.2 Reporter: Bruce Robbins In interpreted mode, ordering by a UDT will result in an exception. For example: {noformat} import org.apache.spark.ml.linalg.{DenseVector, Vector} val df = Seq.tabulate(30) { x => (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 1)/100.0).toDouble, ((x + 3)/100.0).toDouble))) }.toDF("id", "c1", "c2", "c3") df.createOrReplaceTempView("df") // this works sql("select * from df order by c3").collect sql("set spark.sql.codegen.wholeStage=false") sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") // this gets an error sql("select * from df order by c3").collect {noformat} The second {{collect}} action results in the following exception: {noformat} org.apache.spark.SparkIllegalArgumentException: Type UninitializedPhysicalType does not support ordered operations. at org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348) at org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332) at org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329) at org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60) at org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39) at org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) {noformat} Note: You don't get an error if you use {{show}} rather than {{collect}}. This is because {{show}} will implicitly add a {{limit}}, in which case the ordering is performed by {{TakeOrderedAndProject}} rather than {{UnsafeExternalRowSorter}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792942#comment-17792942 ] Bruce Robbins commented on SPARK-45644: --- Even though this is the original issue, I closed it as a duplicate because the fix was applied under SPARK-45896. > After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException > "scala.Some is not a valid external type for schema of array" > -- > > Key: SPARK-45644 > URL: https://issues.apache.org/jira/browse/SPARK-45644 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Adi Wehrli >Priority: Major > > I do not really know if this is a bug, but I am at the end with my knowledge. > A Spark job ran successfully with Spark 3.2.x and 3.3.x. > But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job > with the same data the following always occurs now: > {code} > scala.Some is not a valid external type for schema of array > {code} > The corresponding stacktrace is: > {code} > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch > worker for task 0.0 in stage 0.0 (TID 0)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.scheduler.Task.run(Task.scala:141) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > [spark-core_2.12-3.5.0.jar:3.5.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch > worker for task 1.0 in stage 0.0 (TID 1)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at >
[jira] [Resolved] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins resolved SPARK-45644. --- Resolution: Duplicate > After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException > "scala.Some is not a valid external type for schema of array" > -- > > Key: SPARK-45644 > URL: https://issues.apache.org/jira/browse/SPARK-45644 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Adi Wehrli >Priority: Major > > I do not really know if this is a bug, but I am at the end with my knowledge. > A Spark job ran successfully with Spark 3.2.x and 3.3.x. > But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job > with the same data the following always occurs now: > {code} > scala.Some is not a valid external type for schema of array > {code} > The corresponding stacktrace is: > {code} > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch > worker for task 0.0 in stage 0.0 (TID 0)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.scheduler.Task.run(Task.scala:141) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > [spark-core_2.12-3.5.0.jar:3.5.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch > worker for task 1.0 in stage 0.0 (TID 1)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown > Source) ~[?:?] > at >
[jira] [Updated] (SPARK-46189) Various Pandas functions fail in interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-46189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-46189: -- Description: Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) fail with an unboxing-related exception when run in interpreted mode. Here are some reproduction cases for pyspark interactive mode: {noformat} spark.sql("set spark.sql.codegen.wholeStage=false") spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") import numpy as np import pandas as pd import pyspark.pandas as ps pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a") psser = ps.from_pandas(pser) # each of the following actions gets an unboxing error psser.kurt() psser.var() psser.skew() # set up for covariance test pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"]) psdf = ps.from_pandas(pdf) # this gets an unboxing error psdf.cov() # set up for stddev resr from pyspark.pandas.spark import functions as SF from pyspark.sql.functions import col from pyspark.sql import Row df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), Row(a=8)]) # this gets an unboxing error df.select(SF.stddev(col("a"), 1)).collect() {noformat} Exception from the first case ({{psser.kurt()}}) is {noformat} java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.Double (java.lang.Integer and java.lang.Double are in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112) at org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184) at scala.math.Ordering.lt(Ordering.scala:98) at scala.math.Ordering.lt$(Ordering.scala:98) at org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184) at org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196) {noformat} was: Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) fail with an unboxing-related exception when run in interpreted mode. Here are some reproduction cases for pyspark interactive mode: {noformat} sql("set spark.sql.codegen.wholeStage=false") spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") import numpy as np import pandas as pd import pyspark.pandas as ps pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a") psser = ps.from_pandas(pser) # each of the following actions gets an unboxing error psser.kurt() psser.var() psser.skew() # set up for covariance test pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"]) psdf = ps.from_pandas(pdf) # this gets an unboxing error psdf.cov() # set up for stddev resr from pyspark.pandas.spark import functions as SF from pyspark.sql.functions import col from pyspark.sql import Row df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), Row(a=8)]) # this gets an unboxing error df.select(SF.stddev(col("a"), 1)).collect() {noformat} Exception from the first case ({{psser.kurt()}}) is {noformat} java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.Double (java.lang.Integer and java.lang.Double are in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112) at org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184) at scala.math.Ordering.lt(Ordering.scala:98) at scala.math.Ordering.lt$(Ordering.scala:98) at org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184) at org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196) {noformat} > Various Pandas functions fail in interpreted mode > - > > Key: SPARK-46189 > URL: https://issues.apache.org/jira/browse/SPARK-46189 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and > {{stddev}}) fail with an unboxing-related exception when run in interpreted > mode. > Here are some reproduction cases for pyspark interactive mode: > {noformat} > spark.sql("set spark.sql.codegen.wholeStage=false") > spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > import numpy as np > import pandas as pd > import pyspark.pandas as ps > pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a") > psser = ps.from_pandas(pser) > # each of the following actions gets an unboxing error > psser.kurt() > psser.var() > psser.skew() > # set up for
[jira] [Created] (SPARK-46189) Various Pandas functions fail in interpreted mode
Bruce Robbins created SPARK-46189: - Summary: Various Pandas functions fail in interpreted mode Key: SPARK-46189 URL: https://issues.apache.org/jira/browse/SPARK-46189 Project: Spark Issue Type: Bug Components: Pandas API on Spark, SQL Affects Versions: 3.5.0, 3.4.1 Reporter: Bruce Robbins Various Pandas functions ({{kurt}}, {{var}}, {{skew}}, {{cov}}, and {{stddev}}) fail with an unboxing-related exception when run in interpreted mode. Here are some reproduction cases for pyspark interactive mode: {noformat} sql("set spark.sql.codegen.wholeStage=false") spark.sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") import numpy as np import pandas as pd import pyspark.pandas as ps pser = pd.Series([1, 2, 3, 7, 9, 8], index=np.random.rand(6), name="a") psser = ps.from_pandas(pser) # each of the following actions gets an unboxing error psser.kurt() psser.var() psser.skew() # set up for covariance test pdf = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)], columns=["a", "b"]) psdf = ps.from_pandas(pdf) # this gets an unboxing error psdf.cov() # set up for stddev resr from pyspark.pandas.spark import functions as SF from pyspark.sql.functions import col from pyspark.sql import Row df = spark.createDataFrame([Row(a=1), Row(a=2), Row(a=3), Row(a=7), Row(a=9), Row(a=8)]) # this gets an unboxing error df.select(SF.stddev(col("a"), 1)).collect() {noformat} Exception from the first case ({{psser.kurt()}}) is {noformat} java.lang.ClassCastException: class java.lang.Integer cannot be cast to class java.lang.Double (java.lang.Integer and java.lang.Double are in module java.base of loader 'bootstrap') at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:112) at org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.compare(PhysicalDataType.scala:184) at scala.math.Ordering.lt(Ordering.scala:98) at scala.math.Ordering.lt$(Ordering.scala:98) at org.apache.spark.sql.catalyst.types.PhysicalDoubleType$$anonfun$2.lt(PhysicalDataType.scala:184) at org.apache.spark.sql.catalyst.expressions.LessThan.nullSafeEval(predicates.scala:1196) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17785234#comment-17785234 ] Bruce Robbins commented on SPARK-45896: --- I think I have a handle on this and will make a PR shortly. > Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal] > -- > > Key: SPARK-45896 > URL: https://issues.apache.org/jira/browse/SPARK-45896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following action fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > val df = Seq(Seq(Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class > scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) > AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > However, it succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: array>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) > {noformat} > Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: > externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), > assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), > lambdavariable(ExternalMapToCatalyst_value, ObjectType(class > java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -3), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, > ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), > ObjectType(class scala.Option))), None), input[0, > scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > As with the first example, this succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: map>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) > {noformat} > Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3: > - {{Seq[Option[Timestamp]]}} > - {{Map[Option[Timestamp]]}} > - {{Seq[Option[Date]]}} > - {{Map[Option[Date]]}} > - {{Seq[Option[BigDecimal]]}} > - {{Map[Option[BigDecimal]]}} > However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master: > - {{Seq[Option[Map]]}} > - {{Map[Option[Map]]}} > - {{Seq[Option[]]}} > - {{Map[Option[]]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45896: -- Description: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} As with the first example, this succeeds on 3.3.3: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: map>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) {noformat} Other cases the fail on 3.4.1, 3.5.0, and master but work fine on 3.3.3: - {{Seq[Option[Timestamp]]}} - {{Map[Option[Timestamp]]}} - {{Seq[Option[Date]]}} - {{Map[Option[Date]]}} - {{Seq[Option[BigDecimal]]}} - {{Map[Option[BigDecimal]]}} However, the following work fine on 3.3.3, 3.4.1, 3.5.0, and master: - {{Seq[Option[Map]]}} - {{Map[Option[Map]]}} - {{Seq[Option[]]}} - {{Map[Option[]]}} was: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df:
[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45896: -- Summary: Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal] (was: Expression encoding fails for Seq/Map of Option[Seq]) > Expression encoding fails for Seq/Map of Option[Seq/Date/Timestamp/BigDecimal] > -- > > Key: SPARK-45896 > URL: https://issues.apache.org/jira/browse/SPARK-45896 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following action fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > val df = Seq(Seq(Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -1), > mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), > true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class > scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) > AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > However, it succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Seq(Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: array>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) > {noformat} > Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed > to encode a value of the expressions: > externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), > assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, > ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), > lambdavariable(ExternalMapToCatalyst_value, ObjectType(class > java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, > ObjectType(class java.lang.Object), true, -3), > assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class > java.lang.Object), true, -3), IntegerType, IntegerType)), > unwrapoption(ObjectType(interface scala.collection.immutable.Seq), > validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, > ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), > ObjectType(class scala.Option))), None), input[0, > scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 > ... > Caused by: java.lang.RuntimeException: scala.Some is not a valid external > type for schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown > Source) > ... > {noformat} > As with the first example, this succeeds on 3.3.3: > {noformat} > scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") > df: org.apache.spark.sql.DataFrame = [a: map>] > scala> df.collect > res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]
[ https://issues.apache.org/jira/browse/SPARK-45896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45896: -- Description: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of Option[Seq] also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} As with the first example, this succeeds on 3.3.3: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: map>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) {noformat} was: The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of option of sequence also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to
[jira] [Created] (SPARK-45896) Expression encoding fails for Seq/Map of Option[Seq]
Bruce Robbins created SPARK-45896: - Summary: Expression encoding fails for Seq/Map of Option[Seq] Key: SPARK-45896 URL: https://issues.apache.org/jira/browse/SPARK-45896 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1 Reporter: Bruce Robbins The following action fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") val df = Seq(Seq(Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -2), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -1), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Seq, true], None) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} However, it succeeds on 3.3.3: {noformat} scala> val df = Seq(Seq(Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: array>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([WrappedArray(WrappedArray(0))]) {noformat} Map of option of sequence also fails on 3.4.1, 3.5.0, and master: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") val df = Seq(Map(0 -> Some(Seq(0.toDF("a") org.apache.spark.SparkRuntimeException: [EXPRESSION_ENCODING_FAILED] Failed to encode a value of the expressions: externalmaptocatalyst(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), assertnotnull(validateexternaltype(lambdavariable(ExternalMapToCatalyst_key, ObjectType(class java.lang.Object), false, -1), IntegerType, IntegerType)), lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), mapobjects(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), assertnotnull(validateexternaltype(lambdavariable(MapObject, ObjectType(class java.lang.Object), true, -3), IntegerType, IntegerType)), unwrapoption(ObjectType(interface scala.collection.immutable.Seq), validateexternaltype(lambdavariable(ExternalMapToCatalyst_value, ObjectType(class java.lang.Object), true, -2), ArrayType(IntegerType,false), ObjectType(class scala.Option))), None), input[0, scala.collection.immutable.Map, true]) AS value#0 to a row. SQLSTATE: 42846 ... Caused by: java.lang.RuntimeException: scala.Some is not a valid external type for schema of array at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_0$(Unknown Source) ... {noformat} As with the first example, this succeeds on 3.3.3: {noformat} scala> val df = Seq(Map(0 -> Some(Seq(0.toDF("a") df: org.apache.spark.sql.DataFrame = [a: map>] scala> df.collect res0: Array[org.apache.spark.sql.Row] = Array([Map(0 -> WrappedArray(0))]) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45797) Discrepancies in PySpark DataFrame Results When Using Window Functions and Filters
[ https://issues.apache.org/jira/browse/SPARK-45797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783015#comment-17783015 ] Bruce Robbins commented on SPARK-45797: --- I wonder if this is the same as SPARK-45543, which had two window specs and then produced wrong answers when filtered on rank = 1. > Discrepancies in PySpark DataFrame Results When Using Window Functions and > Filters > -- > > Key: SPARK-45797 > URL: https://issues.apache.org/jira/browse/SPARK-45797 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.5.0 > Environment: Python 3.10 > Pyspark 3.5.0 > Ubuntu 22.04.3 LTS >Reporter: Daniel Diego Horcajuelo >Priority: Major > Fix For: 3.5.0 > > > When doing certain types of transformations on a dataframe which involve > window functions with filters I am getting the wrong results. Here is a > minimal example of the results I get with my code: > > {code:java} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > from pyspark.sql.window import Window as w > from datetime import datetime, date > spark = SparkSession.builder.config("spark.sql.repl.eagerEval.enabled", > True).getOrCreate() > # Base dataframe > df = spark.createDataFrame( > [ > (1, date(2023, 10, 1), date(2023, 10, 2), "open"), > (1, date(2023, 10, 2), date(2023, 10, 3), "close"), > (2, date(2023, 10, 1), date(2023, 10, 2), "close"), > (2, date(2023, 10, 2), date(2023, 10, 4), "close"), > (3, date(2023, 10, 2), date(2023, 10, 4), "open"), > (3, date(2023, 10, 3), date(2023, 10, 6), "open"), > ], > schema="id integer, date_start date, date_end date, status string" > ) > # We define two partition functions > partition = w.partitionBy("id").orderBy("date_start", > "date_end").rowsBetween(w.unboundedPreceding, w.unboundedFollowing) > partition2 = w.partitionBy("id").orderBy("date_start", "date_end") > # Define dataframe A > A = df.withColumn( > "date_end_of_last_close", > f.max(f.when(f.col("status") == "close", > f.col("date_end"))).over(partition) > ).withColumn( > "rank", > f.row_number().over(partition2) > ) > display(A) > | id | date_start | date_end | status | date_end_of_last_close | rank | > ||||||--| > | 1 | 2023-10-01 | 2023-10-02 | open | 2023-10-03 | 1| > | 1 | 2023-10-02 | 2023-10-03 | close | 2023-10-03 | 2| > | 2 | 2023-10-01 | 2023-10-02 | close | 2023-10-04 | 1| > | 2 | 2023-10-02 | 2023-10-04 | close | 2023-10-04 | 2| > | 3 | 2023-10-02 | 2023-10-04 | open | NULL | 1| > | 3 | 2023-10-03 | 2023-10-06 | open | NULL | 2| > # When filtering by rank = 1, I get this weird result > A_result = A.filter(f.col("rank") == 1).drop("rank") > display(A_result) > | id | date_start | date_end | status | date_end_of_last_close | > |||||| > | 1 | 2023-10-01 | 2023-10-02 | open | NULL | > | 2 | 2023-10-01 | 2023-10-02 | close | 2023-10-02 | > | 3 | 2023-10-02 | 2023-10-04 | open | NULL | {code} > I think spark engine might be managing wrongly the internal partitions. If > creating the dataframe from scratch (without transformations), the filtering > operation returns the right result. In pyspark 3.4.0 this error doesn't > happen. > > For more details, please check out this same question in stackoverflow: > [stackoverflow > question|https://stackoverflow.com/questions/77396807/discrepancies-in-pyspark-dataframe-results-when-using-window-functions-and-filte?noredirect=1#comment136446225_77396807] > > I'll mark this issue as important because it affects some basic operations > which are daily used -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781531#comment-17781531 ] Bruce Robbins commented on SPARK-45644: --- I will look into it and try to submit a fix. If I can't, I will ping someone who can. > After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException > "scala.Some is not a valid external type for schema of array" > -- > > Key: SPARK-45644 > URL: https://issues.apache.org/jira/browse/SPARK-45644 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Adi Wehrli >Priority: Major > > I do not really know if this is a bug, but I am at the end with my knowledge. > A Spark job ran successfully with Spark 3.2.x and 3.3.x. > But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job > with the same data the following always occurs now: > {code} > scala.Some is not a valid external type for schema of array > {code} > The corresponding stacktrace is: > {code} > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch > worker for task 0.0 in stage 0.0 (TID 0)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.scheduler.Task.run(Task.scala:141) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > [spark-core_2.12-3.5.0.jar:3.5.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch > worker for task 1.0 in stage 0.0 (TID 1)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at >
[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781494#comment-17781494 ] Bruce Robbins commented on SPARK-45644: --- OK, I can reproduce. I will take a look. I will also try to get my reproduction example down to a minimal case and will post here later. > After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException > "scala.Some is not a valid external type for schema of array" > -- > > Key: SPARK-45644 > URL: https://issues.apache.org/jira/browse/SPARK-45644 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Adi Wehrli >Priority: Major > > I do not really know if this is a bug, but I am at the end with my knowledge. > A Spark job ran successfully with Spark 3.2.x and 3.3.x. > But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job > with the same data the following always occurs now: > {code} > scala.Some is not a valid external type for schema of array > {code} > The corresponding stacktrace is: > {code} > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch > worker for task 0.0 in stage 0.0 (TID 0)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.scheduler.Task.run(Task.scala:141) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > [spark-core_2.12-3.5.0.jar:3.5.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch > worker for task 1.0 in stage 0.0 (TID 1)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at >
[jira] [Commented] (SPARK-45644) After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException "scala.Some is not a valid external type for schema of array"
[ https://issues.apache.org/jira/browse/SPARK-45644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781091#comment-17781091 ] Bruce Robbins commented on SPARK-45644: --- You can turn on display of the generated code by adding the following to your log4j conf: {noformat} logger.codegen.name = org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator logger.codegen.level = debug {noformat} Do you have any application code you can share? It looks like the error happens at the start of the job (task 0 stage 0). > After upgrading to Spark 3.4.1 and 3.5.0 we receive RuntimeException > "scala.Some is not a valid external type for schema of array" > -- > > Key: SPARK-45644 > URL: https://issues.apache.org/jira/browse/SPARK-45644 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Adi Wehrli >Priority: Major > > I do not really know if this is a bug, but I am at the end with my knowledge. > A Spark job ran successfully with Spark 3.2.x and 3.3.x. > But after upgrading to 3.4.1 (as well as with 3.5.0) running the same job > with the same data the following always occurs now: > {code} > scala.Some is not a valid external type for schema of array > {code} > The corresponding stacktrace is: > {code} > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 0.0 in stage 0.0 (TID 0)" thread="Executor task launch > worker for task 0.0 in stage 0.0 (TID 0)" > java.lang.RuntimeException: scala.Some is not a valid external type for > schema of array > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.MapObjects_10$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.ExternalMapToCatalyst_1$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.createNamedStruct_14_3$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.If_12$(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) ~[?:?] > at > org.apache.spark.sql.execution.ObjectOperator$.$anonfun$serializeObjectToRow$1(objects.scala:165) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.sql.execution.AppendColumnsWithObjectExec.$anonfun$doExecute$15(objects.scala:380) > ~[spark-sql_2.12-3.5.0.jar:3.5.0] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) > ~[scala-library-2.12.15.jar:?] > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:169) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.scheduler.Task.run(Task.scala:141) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) > ~[spark-common-utils_2.12-3.5.0.jar:3.5.0] > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94) > ~[spark-core_2.12-3.5.0.jar:3.5.0] > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623) > [spark-core_2.12-3.5.0.jar:3.5.0] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > [?:?] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > [?:?] > at java.lang.Thread.run(Thread.java:834) [?:?] > 2023-10-24T06:28:50.932 level=ERROR logger=org.apache.spark.executor.Executor > msg="Exception in task 1.0 in stage 0.0 (TID 1)" thread="Executor task launch > worker for task 1.0 in stage 0.0 (TID 1)" > java.lang.RuntimeException:
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45580: -- Summary: Subquery changes the output schema of outer query (was: RewritePredicateSubquery unexpectedly changes the output schema of certain queries) > Subquery changes the output schema of outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) Subquery changes the output schema of the outer query
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45580: -- Summary: Subquery changes the output schema of the outer query (was: Subquery changes the output schema of outer query) > Subquery changes the output schema of the outer query > - > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.
[ https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins resolved SPARK-45583. --- Resolution: Fixed > Spark SQL returning incorrect values for full outer join on keys with the > same name. > > > Key: SPARK-45583 > URL: https://issues.apache.org/jira/browse/SPARK-45583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Huw >Priority: Major > Fix For: 3.5.0 > > > {{The following query gives the wrong results.}} > > {{WITH people as (}} > {{ SELECT * FROM (VALUES }} > {{ (1, 'Peter'), }} > {{ (2, 'Homer'), }} > {{ (3, 'Ned'),}} > {{ (3, 'Jenny')}} > {{ ) AS Idiots(id, FirstName)}} > {{{}){}}}{{{}, location as ({}}} > {{ SELECT * FROM (VALUES}} > {{ (1, 'sample0'),}} > {{ (1, 'sample1'),}} > {{ (2, 'sample2') }} > {{ ) as Locations(id, address)}} > {{{}){}}}{{{}SELECT{}}} > {{ *}} > {{FROM}} > {{ people}} > {{FULL OUTER JOIN}} > {{ location}} > {{ON}} > {{ people.id = location.id}} > {{We find the following table:}} > ||id: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |null|Ned|null|null| > |null|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| > {{But clearly the first `id` column is wrong, the nulls should be 3.}} > If we rename the id column in (only) the person table to pid we get the > correct results: > ||pid: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |3|Ned|null|null| > |3|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45601) stackoverflow when executing rule ExtractWindowExpressions
[ https://issues.apache.org/jira/browse/SPARK-45601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1304#comment-1304 ] Bruce Robbins commented on SPARK-45601: --- Possibly SPARK-38666 > stackoverflow when executing rule ExtractWindowExpressions > -- > > Key: SPARK-45601 > URL: https://issues.apache.org/jira/browse/SPARK-45601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.3 >Reporter: JacobZheng >Priority: Major > > I am encountering stackoverflow errors while executing the following test > case. I looked at the source code and it is ExtractWindowExpressions not > extracting the window correctly and encountering a dead loop at > resolveOperatorsDownWithPruning that is causing it. > {code:scala} > test("agg filter contains window") { > val src = Seq((1, "b", "c")).toDF("col1", "col2", "col3") > .withColumn("test", > expr("count(col1) filter (where min(col1) over(partition by col2 > order by col3)>1)")) > src.show() > } > {code} > Now my question is this kind of in agg filter (window) is the correct usage? > Or should I add a check like spark sql and throw an error "It is not allowed > to use window functions inside WHERE clause"? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45583) Spark SQL returning incorrect values for full outer join on keys with the same name.
[ https://issues.apache.org/jira/browse/SPARK-45583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776783#comment-17776783 ] Bruce Robbins commented on SPARK-45583: --- Strangely, I cannot reproduce. Is some setting required? {noformat} sql("select version()").show(false) +--+ |version() | +--+ |3.5.0 ce5ddad990373636e94071e7cef2f31021add07b| +--+ scala> sql("""WITH people as ( SELECT * FROM (VALUES (1, 'Peter'), (2, 'Homer'), (3, 'Ned'), (3, 'Jenny') ) AS Idiots(id, FirstName) ), location as ( SELECT * FROM (VALUES (1, 'sample0'), (1, 'sample1'), (2, 'sample2') ) as Locations(id, address) )SELECT * FROM people FULL OUTER JOIN location ON people.id = location.id""").show(false) | | | | | | | | | | | | | | | | | | | | +---+-++---+ |id |FirstName|id |address| +---+-++---+ |1 |Peter|1 |sample0| |1 |Peter|1 |sample1| |2 |Homer|2 |sample2| |3 |Ned |NULL|NULL | |3 |Jenny|NULL|NULL | +---+-++---+ scala> {noformat} > Spark SQL returning incorrect values for full outer join on keys with the > same name. > > > Key: SPARK-45583 > URL: https://issues.apache.org/jira/browse/SPARK-45583 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Huw >Priority: Major > > {{The following query gives the wrong results.}} > > {{WITH people as (}} > {{ SELECT * FROM (VALUES }} > {{ (1, 'Peter'), }} > {{ (2, 'Homer'), }} > {{ (3, 'Ned'),}} > {{ (3, 'Jenny')}} > {{ ) AS Idiots(id, FirstName)}} > {{{}){}}}{{{}, location as ({}}} > {{ SELECT * FROM (VALUES}} > {{ (1, 'sample0'),}} > {{ (1, 'sample1'),}} > {{ (2, 'sample2') }} > {{ ) as Locations(id, address)}} > {{{}){}}}{{{}SELECT{}}} > {{ *}} > {{FROM}} > {{ people}} > {{FULL OUTER JOIN}} > {{ location}} > {{ON}} > {{ people.id = location.id}} > {{We find the following table:}} > ||id: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |null|Ned|null|null| > |null|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| > {{But clearly the first `id` column is wrong, the nulls should be 3.}} > If we rename the id column in (only) the person table to pid we get the > correct results: > ||pid: integer||FirstName: string||id: integer||address: string|| > |2|Homer|2|sample2| > |3|Ned|null|null| > |3|Jenny|null|null| > |1|Peter|1|sample0| > |1|Peter|1|sample1| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17776401#comment-17776401 ] Bruce Robbins commented on SPARK-45580: --- I'll make a PR in the coming days. > RewritePredicateSubquery unexpectedly changes the output schema of certain > queries > -- > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or replace temp view t1(a) as values (1), (2), (3), (7); > create or replace temp view t2(c1) as values (1), (2), (3); > create or replace temp view t3(col1) as values (3), (9); > cache table t1; > cache table t2; > cache table t3; > {noformat} > When run in {{spark-sql}}, the following query has a superfluous boolean > column: > {noformat} > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > 1 false > 2 false > 3 true > {noformat} > The result should be: > {noformat} > 1 > 2 > 3 > {noformat} > When executed via the {{Dataset}} API, you don't see the incorrect result, > because the Dataset API truncates the right-side of the rows based on the > analyzed plan's schema (it's the optimized plan's schema that goes wrong). > However, even with the {{Dataset}} API, this query goes wrong: > {noformat} > select ( > select * > from t1 > where exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ) > limit 1 > ) > from range(1); > java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; > something went wrong in analysis > at scala.Predef$.assert(Predef.scala:279) > at > org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) > at scala.collection.AbstractIterable.foreach(Iterable.scala:933) > ... > {noformat} > Other queries that have the wrong schema: > {noformat} > select * > from t1 > where a in ( > select c1 > from t2 > where a in (select col1 from t3) > ); > {noformat} > and > {noformat} > select * > from t1 > where not exists ( > select c1 > from t2 > where a = c1 > or a in (select col1 from t3) > ); > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries
[ https://issues.apache.org/jira/browse/SPARK-45580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45580: -- Description: A query can have an incorrect output schema because of a subquery. Assume this data: {noformat} create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); cache table t1; cache table t2; cache table t3; {noformat} When run in {{spark-sql}}, the following query has a superfluous boolean column: {noformat} select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true {noformat} The result should be: {noformat} 1 2 3 {noformat} When executed via the {{Dataset}} API, you don't see the incorrect result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong). However, even with the {{Dataset}} API, this query goes wrong: {noformat} select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) ... {noformat} Other queries that have the wrong schema: {noformat} select * from t1 where a in ( select c1 from t2 where a in (select col1 from t3) ); {noformat} and {noformat} select * from t1 where not exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); {noformat} was: A query can have an incorrect output schema because of a subquery. Assume this data: {noformat} create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); cache table t1; cache table t2; cache table t3; {noformat} When run in {{spark-sql}}, the following query has a superfluous boolean column: {noformat} select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true {noformat} The result should be: {noformat} 1 2 3 {noformat} When executed via the {{Dataset}} API, you don't see this result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong). However, even with the {{Dataset}} API, this query goes wrong: {noformat} select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) ... {noformat} Other queries that have the wrong schema: {noformat} select * from t1 where a in ( select c1 from t2 where a in (select col1 from t3) ); {noformat} and {noformat} select * from t1 where not exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); {noformat} > RewritePredicateSubquery unexpectedly changes the output schema of certain > queries > -- > > Key: SPARK-45580 > URL: https://issues.apache.org/jira/browse/SPARK-45580 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > A query can have an incorrect output schema because of a subquery. > Assume this data: > {noformat} > create or
[jira] [Created] (SPARK-45580) RewritePredicateSubquery unexpectedly changes the output schema of certain queries
Bruce Robbins created SPARK-45580: - Summary: RewritePredicateSubquery unexpectedly changes the output schema of certain queries Key: SPARK-45580 URL: https://issues.apache.org/jira/browse/SPARK-45580 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1, 3.3.3 Reporter: Bruce Robbins A query can have an incorrect output schema because of a subquery. Assume this data: {noformat} create or replace temp view t1(a) as values (1), (2), (3), (7); create or replace temp view t2(c1) as values (1), (2), (3); create or replace temp view t3(col1) as values (3), (9); cache table t1; cache table t2; cache table t3; {noformat} When run in {{spark-sql}}, the following query has a superfluous boolean column: {noformat} select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); 1 false 2 false 3 true {noformat} The result should be: {noformat} 1 2 3 {noformat} When executed via the {{Dataset}} API, you don't see this result, because the Dataset API truncates the right-side of the rows based on the analyzed plan's schema (it's the optimized plan's schema that goes wrong). However, even with the {{Dataset}} API, this query goes wrong: {noformat} select ( select * from t1 where exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ) limit 1 ) from range(1); java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.execution.ScalarSubquery.updateResult(subquery.scala:88) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1(SparkPlan.scala:276) at org.apache.spark.sql.execution.SparkPlan.$anonfun$waitForSubqueries$1$adapted(SparkPlan.scala:275) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:576) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:574) at scala.collection.AbstractIterable.foreach(Iterable.scala:933) ... {noformat} Other queries that have the wrong schema: {noformat} select * from t1 where a in ( select c1 from t2 where a in (select col1 from t3) ); {noformat} and {noformat} select * from t1 where not exists ( select c1 from t2 where a = c1 or a in (select col1 from t3) ); {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45440) Incorrect summary counts from a CSV file
[ https://issues.apache.org/jira/browse/SPARK-45440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772724#comment-17772724 ] Bruce Robbins commented on SPARK-45440: --- I added {{inferSchema=true}} as a datasource option in your example and I got the expected answer. Otherwise it's doing a max and min on a string (not a number). > Incorrect summary counts from a CSV file > > > Key: SPARK-45440 > URL: https://issues.apache.org/jira/browse/SPARK-45440 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.5.0 > Environment: Pyspark version 3.5.0 >Reporter: Evan Volgas >Priority: Major > Labels: aggregation, bug, pyspark > > I am using pip-installed Pyspark version 3.5.0 inside the context of an > IPython shell. The task is straightforward: take [this CSV > file|https://gist.githubusercontent.com/evanvolgas/e5cb082673ec947239658291f2251de4/raw/a9c5e9866ac662a816f9f3828a2d184032f604f0/AAPL.csv] > of AAPL stock prices and compute the minimum and maximum volume weighted > average price for the entire file. > My code is [here. > |https://gist.github.com/evanvolgas/e4aa75fec4179bb7075a5283867f127c]I've > also performed the same computation in DuckDB because I noticed that the > results of the Spark code are wrong. > Literally, the exact same SQL in DuckDB and in Spark yield different results, > and Spark's are wrong. > I have never seen this behavior in a Spark release before. I'm very confused > by it, and curious if anyone else can replicate this behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45171) GenerateExec fails to initialize non-deterministic expressions before use
Bruce Robbins created SPARK-45171: - Summary: GenerateExec fails to initialize non-deterministic expressions before use Key: SPARK-45171 URL: https://issues.apache.org/jira/browse/SPARK-45171 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins The following query fails: {noformat} select * from explode( transform(sequence(0, cast(rand()*1000 as int) + 1), x -> x * 22) ); {noformat} The error is: {noformat} 23/09/14 09:27:25 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.IllegalArgumentException: requirement failed: Nondeterministic expression org.apache.spark.sql.catalyst.expressions.Rand should be initialized before eval. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval(Expression.scala:497) at org.apache.spark.sql.catalyst.expressions.Nondeterministic.eval$(Expression.scala:495) at org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:35) at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:543) at org.apache.spark.sql.catalyst.expressions.BinaryArithmetic.eval(arithmetic.scala:384) at org.apache.spark.sql.catalyst.expressions.Sequence.eval(collectionOperations.scala:3062) at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval(higherOrderFunctions.scala:275) at org.apache.spark.sql.catalyst.expressions.SimpleHigherOrderFunction.eval$(higherOrderFunctions.scala:274) at org.apache.spark.sql.catalyst.expressions.ArrayTransform.eval(higherOrderFunctions.scala:308) at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:375) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) ... {noformat} However, this query succeeds: {noformat} select * from explode( sequence(0, cast(rand()*1000 as int) + 1) ); {noformat} The difference is that {{transform}} turns off whole-stage codegen, which exposes a bug in {{GenerateExec}} where the non-deterministic expression passed to the generator function is not initialized before being used. An even simpler reprod case is: {noformat} set spark.sql.codegen.wholeStage=false; select explode(array(rand())); {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns
[ https://issues.apache.org/jira/browse/SPARK-44912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763455#comment-17763455 ] Bruce Robbins commented on SPARK-44912: --- It looks like this was fixed with SPARK-45071. Your issue was reported earlier, but missed somehow. > Spark 3.4 multi-column sum slows with many columns > -- > > Key: SPARK-44912 > URL: https://issues.apache.org/jira/browse/SPARK-44912 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0, 3.4.1 >Reporter: Brady Bickel >Priority: Major > > The code below is a minimal reproducible example of an issue I discovered > with Pyspark 3.4.x. I want to sum the values of multiple columns and put the > sum of those columns (per row) into a new column. This code works and returns > in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in > Pyspark 3.4.x when the number of columns grows. See below for execution > timing summary as N varies. > {code:java} > import pyspark.sql.functions as F > import random > import string > from functools import reduce > from operator import add > from pyspark.sql import SparkSession > spark = SparkSession.builder.getOrCreate() > # generate a dataframe N columns by M rows with random 8 digit column > # names and random integers in [-5,10] > N = 30 > M = 100 > columns = [''.join(random.choices(string.ascii_uppercase + > string.digits, k=8)) >for _ in range(N)] > data = [tuple([random.randint(-5,10) for _ in range(N)]) > for _ in range(M)] > df = spark.sparkContext.parallelize(data).toDF(columns) > # 3 ways to add a sum column, all of them slow for high N in spark 3.4 > df = df.withColumn("col_sum1", sum(df[col] for col in columns)) > df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns])) > df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code} > Timing results for Spark 3.3: > ||N||Exe Time (s)|| > |5|0.514| > |10|0.248| > |15|0.327| > |20|0.403| > |25|0.279| > |30|0.322| > |50|0.430| > Timing results for Spark 3.4: > ||N||Exe Time (s)|| > |5|0.379| > |10|0.318| > |15|0.405| > |20|1.32| > |25|28.8| > |30|448| > |50|>1 (did not finish)| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check
[ https://issues.apache.org/jira/browse/SPARK-45106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-45106: -- Affects Version/s: 3.3.2 > percentile_cont gets internal error when user input fails runtime > replacement's input type check > - > > Key: SPARK-45106 > URL: https://issues.apache.org/jira/browse/SPARK-45106 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Bruce Robbins >Priority: Major > Labels: pull-request-available > > This query throws an internal error rather than producing a useful error > message: > {noformat} > select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x > from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b); > [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression > "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)". > org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime > replaceable expression "percentile_cont(a, b)". The replacement is > unresolved: "percentile(a, b, 1)". > at > org.apache.spark.SparkException$.internalError(SparkException.scala:92) > at > org.apache.spark.SparkException$.internalError(SparkException.scala:96) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277) > ... > {noformat} > It should instead inform the user that the input expression must be foldable. > {{PercentileCont}} does not check the user's input. If the runtime > replacement (an instance of {{Percentile}}) rejects the user's input, the > runtime replacement ends up unresolved. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check
Bruce Robbins created SPARK-45106: - Summary: percentile_cont gets internal error when user input fails runtime replacement's input type check Key: SPARK-45106 URL: https://issues.apache.org/jira/browse/SPARK-45106 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1, 3.5.0, 4.0.0 Reporter: Bruce Robbins This query throws an internal error rather than producing a useful error message: {noformat} select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b); [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)". org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)". at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277) ... {noformat} It should instead inform the user that the input expression must be foldable. {{PercentileCont}} does not check the user's input. If the runtime replacement (an instance of {{Percentile}}) rejects the user's input, the runtime replacement ends up unresolved. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-44805: -- Affects Version/s: 3.4.1 > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1, 3.4.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > Labels: correctness > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best regards, > Jakub -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762792#comment-17762792 ] Bruce Robbins commented on SPARK-44805: --- PR here: https://github.com/apache/spark/pull/42850 > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > Labels: correctness > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best regards, > Jakub -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762234#comment-17762234 ] Bruce Robbins commented on SPARK-44805: --- I looked at this yesterday and I think I have a handle on what's going on. I will make a PR in the coming days. > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > Labels: correctness > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best regards, > Jakub -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-44805: -- Labels: correctness (was: ) > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > Labels: correctness > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best regards, > Jakub -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754344#comment-17754344 ] Bruce Robbins edited comment on SPARK-44805 at 8/15/23 12:26 AM: - [~sunchao] It seems to be some weird interaction between Parquet nested vectorization and the {{Cast}} expression: {noformat} drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select value from t1; {"f1":[1,2,3],"f2":[1,1,2]} <== this is expected Time taken: 0.126 seconds, Fetched 1 row(s) select cast(value as struct,f2:array>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} <== this is not expected Time taken: 0.102 seconds, Fetched 1 row(s) set spark.sql.parquet.enableNestedColumnVectorizedReader=false; select cast(value as struct,f2:array>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[1,1,2]} <== now has expected value Time taken: 0.244 seconds, Fetched 1 row(s) {noformat} The union operation adds this {{Cast}} expression because {{value}} has different datatypes between your two dataframes. was (Author: bersprockets): It seems to be some weird interaction between Parquet and the {{Cast}} expression: {noformat} drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select value from t1; {"f1":[1,2,3],"f2":[1,1,2]} <== this is expected Time taken: 0.126 seconds, Fetched 1 row(s) select cast(value as struct,f2:array>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} <== this is not expected Time taken: 0.102 seconds, Fetched 1 row(s) {noformat} The union operation adds this {{Cast}} expression because {{value}} has different datatypes between your two dataframes. > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best
[jira] [Commented] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true
[ https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17754344#comment-17754344 ] Bruce Robbins commented on SPARK-44805: --- It seems to be some weird interaction between Parquet and the {{Cast}} expression: {noformat} drop table if exists t1; create table t1 using parquet as select * from values (named_struct('f1', array(1, 2, 3), 'f2', array(1, 1, 2))) as (value); select value from t1; {"f1":[1,2,3],"f2":[1,1,2]} <== this is expected Time taken: 0.126 seconds, Fetched 1 row(s) select cast(value as struct,f2:array>) AS value from t1; {"f1":[1.0,2.0,3.0],"f2":[0,0,0]} <== this is not expected Time taken: 0.102 seconds, Fetched 1 row(s) {noformat} The union operation adds this {{Cast}} expression because {{value}} has different datatypes between your two dataframes. > Data lost after union using > spark.sql.parquet.enableNestedColumnVectorizedReader=true > - > > Key: SPARK-44805 > URL: https://issues.apache.org/jira/browse/SPARK-44805 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.1 > Environment: pySpark, linux, hadoop, parquet. >Reporter: Jakub Wozniak >Priority: Major > > When union-ing two DataFrames read from parquet containing nested structures > (2 fields of array types where one is double and second is integer) data from > the second field seems to be lost (zeros are set instead). > This seems to be the case only if nested vectorised reader is used > (spark.sql.parquet.enableNestedColumnVectorizedReader=true). > The following Python code reproduces the problem: > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import * > # PREPARING DATA > data1 = [] > data2 = [] > for i in range(2): > data1.append( (([1,2,3],[1,1,2]),i)) > data2.append( (([1.0,2.0,3.0],[1,1]),i+10)) > schema1 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(IntegerType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > schema2 = StructType([ > StructField('value', StructType([ > StructField('f1', ArrayType(DoubleType()), True), > StructField('f2', ArrayType(IntegerType()), True) > ])), > StructField('id', IntegerType(), True) > ]) > spark = SparkSession.builder.getOrCreate() > data_dir = "/user//" > df1 = spark.createDataFrame(data1, schema1) > df1.write.mode('overwrite').parquet(data_dir + "data1") > df2 = spark.createDataFrame(data2, schema2) > df2.write.mode('overwrite').parquet(data_dir + "data2") > # READING DATA > parquet1 = spark.read.parquet(data_dir + "data1") > parquet2 = spark.read.parquet(data_dir + "data2") > # UNION > out = parquet1.union(parquet2) > parquet1.select("value.f2").distinct().show() > out.select("value.f2").distinct().show() > print(parquet1.collect()) > print(out.collect()) {code} > Output: > {code:java} > +-+ > | f2| > +-+ > |[1, 1, 2]| > +-+ > +-+ > | f2| > +-+ > |[0, 0, 0]| > | [1, 1]| > +-+ > [ > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), > Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1) > ] > [ > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), > Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11) > ] {code} > Please notice that values for the field f2 are lost after the union is done. > This only happens when this data is read from parquet files. > Could you please look into this? > Best regards, > Jakub -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44477) CheckAnalysis uses error subclass as an error class
[ https://issues.apache.org/jira/browse/SPARK-44477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744314#comment-17744314 ] Bruce Robbins commented on SPARK-44477: --- PR here: https://github.com/apache/spark/pull/42064 > CheckAnalysis uses error subclass as an error class > --- > > Key: SPARK-44477 > URL: https://issues.apache.org/jira/browse/SPARK-44477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Minor > > {{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, > but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}. > {noformat} > spark-sql (default)> select bitmap_count(12); > [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT' > org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error > class 'TYPE_CHECK_FAILURE_WITH_HINT' > at org.apache.spark.SparkException$.internalError(SparkException.scala:83) > at org.apache.spark.SparkException$.internalError(SparkException.scala:87) > at > org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68) > at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361) > at > scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594) > at > scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589) > at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73) > {noformat} > This issue only occurs when an expression uses > {{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. > {{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of > {{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were > added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and > {{{}BitmapOrAgg{}}}. > {{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use > {{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in > {{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be > corrected (or removed). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44477) CheckAnalysis uses error subclass as an error class
Bruce Robbins created SPARK-44477: - Summary: CheckAnalysis uses error subclass as an error class Key: SPARK-44477 URL: https://issues.apache.org/jira/browse/SPARK-44477 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins {{CheckAnalysis}} treats {{TYPE_CHECK_FAILURE_WITH_HINT}} as an error class, but it is instead an error subclass of {{{}DATATYPE_MISMATCH{}}}. {noformat} spark-sql (default)> select bitmap_count(12); [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT' org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot find main error class 'TYPE_CHECK_FAILURE_WITH_HINT' at org.apache.spark.SparkException$.internalError(SparkException.scala:83) at org.apache.spark.SparkException$.internalError(SparkException.scala:87) at org.apache.spark.ErrorClassesJsonReader.$anonfun$getMessageTemplate$1(ErrorClassesJSONReader.scala:68) at scala.collection.immutable.HashMap$HashMap1.getOrElse0(HashMap.scala:361) at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:594) at scala.collection.immutable.HashMap$HashTrieMap.getOrElse0(HashMap.scala:589) at scala.collection.immutable.HashMap.getOrElse(HashMap.scala:73) {noformat} This issue only occurs when an expression uses {{TypeCheckResult.TypeCheckFailure}} to indicate input type check failure. {{TypeCheckResult.TypeCheckFailure}} appears to be deprecated in favor of {{{}TypeCheckResult.DataTypeMismatch{}}}, but recently two expressions were added that use {{{}TypeCheckResult.TypeCheckFailure{}}}: {{BitmapCount}} and {{{}BitmapOrAgg{}}}. {{BitmapCount}} and {{BitmapOrAgg}} should probably be fixed to use {{{}TypeCheckResult.DataTypeMismatch{}}}. Regardless, the code in {{CheckAnalysis}} that handles {{TypeCheckResult.TypeCheckFailure}} should be corrected (or removed). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value
[ https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-44251: -- Labels: correctness (was: ) > Potential for incorrect results or NPE when full outer USING join has null > key value > > > Key: SPARK-44251 > URL: https://issues.apache.org/jira/browse/SPARK-44251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > The following query produces incorrect results: > {noformat} > create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values (2, 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > -1 <== should be null > 1 > 2 > {noformat} > The following query fails with a {{NullPointerException}}: > {noformat} > create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values ('2', 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value
[ https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-44251: -- Affects Version/s: 3.3.2 > Potential for incorrect results or NPE when full outer USING join has null > key value > > > Key: SPARK-44251 > URL: https://issues.apache.org/jira/browse/SPARK-44251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following query produces incorrect results: > {noformat} > create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values (2, 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > -1 <== should be null > 1 > 2 > {noformat} > The following query fails with a {{NullPointerException}}: > {noformat} > create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values ('2', 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value
[ https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-44251: -- Affects Version/s: 3.4.1 > Potential for incorrect results or NPE when full outer USING join has null > key value > > > Key: SPARK-44251 > URL: https://issues.apache.org/jira/browse/SPARK-44251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following query produces incorrect results: > {noformat} > create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values (2, 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > -1 <== should be null > 1 > 2 > {noformat} > The following query fails with a {{NullPointerException}}: > {noformat} > create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values ('2', 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value
[ https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17739180#comment-17739180 ] Bruce Robbins commented on SPARK-44251: --- PR can be found here: https://github.com/apache/spark/pull/41809 > Potential for incorrect results or NPE when full outer USING join has null > key value > > > Key: SPARK-44251 > URL: https://issues.apache.org/jira/browse/SPARK-44251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following query produces incorrect results: > {noformat} > create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values (2, 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > -1 <== should be null > 1 > 2 > {noformat} > The following query fails with a {{NullPointerException}}: > {noformat} > create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values ('2', 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value
[ https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17738762#comment-17738762 ] Bruce Robbins commented on SPARK-44251: --- This is similar to, but not quite the same as SPARK-43718, and the fix will be similar too. I will make a PR shortly. > Potential for incorrect results or NPE when full outer USING join has null > key value > > > Key: SPARK-44251 > URL: https://issues.apache.org/jira/browse/SPARK-44251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following query produces incorrect results: > {noformat} > create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values (2, 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > -1 <== should be null > 1 > 2 > {noformat} > The following query fails with a {{NullPointerException}}: > {noformat} > create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values ('2', 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44251) Potential for incorrect results or NPE when full outer USING join has null key value
[ https://issues.apache.org/jira/browse/SPARK-44251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-44251: -- Summary: Potential for incorrect results or NPE when full outer USING join has null key value (was: Potentially incorrect results or NPE when full outer USING join has null key value) > Potential for incorrect results or NPE when full outer USING join has null > key value > > > Key: SPARK-44251 > URL: https://issues.apache.org/jira/browse/SPARK-44251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The following query produces incorrect results: > {noformat} > create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values (2, 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > -1 <== should be null > 1 > 2 > {noformat} > The following query fails with a {{NullPointerException}}: > {noformat} > create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); > create or replace temp view v2 as values ('2', 3) as (c1, c2); > select explode(array(c1)) as x > from v1 > full outer join v2 > using (c1); > 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44251) Potentially incorrect results or NPE when full outer USING join has null key value
Bruce Robbins created SPARK-44251: - Summary: Potentially incorrect results or NPE when full outer USING join has null key value Key: SPARK-44251 URL: https://issues.apache.org/jira/browse/SPARK-44251 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins The following query produces incorrect results: {noformat} create or replace temp view v1 as values (1, 2), (null, 7) as (c1, c2); create or replace temp view v2 as values (2, 3) as (c1, c2); select explode(array(c1)) as x from v1 full outer join v2 using (c1); -1 <== should be null 1 2 {noformat} The following query fails with a {{NullPointerException}}: {noformat} create or replace temp view v1 as values ('1', 2), (null, 7) as (c1, c2); create or replace temp view v2 as values ('2', 3) as (c1, c2); select explode(array(c1)) as x from v1 full outer join v2 using (c1); 23/06/25 17:06:39 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 11) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.generate_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.smj_consumeFullOuterJoinRow_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.wholestagecodegen_findNextJoinRows_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) ... {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator
[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735976#comment-17735976 ] Bruce Robbins commented on SPARK-44132: --- [~steven.aerts] Go for it! > nesting full outer joins confuses code generator > > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. >Reporter: Steven Aerts >Priority: Major > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //< null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark shell or using scala case classes but were unable to do so. > We made a reproduction scenario as unit tests (one for each of the stacktrace > above) on the spark code base and made it available as a [pull > request|https://github.com/apache/spark/pull/41688] to this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-44132) nesting full outer joins confuses code generator
[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735944#comment-17735944 ] Bruce Robbins edited comment on SPARK-44132 at 6/22/23 1:51 AM: You may have this figured out already, but in case not, here's a clue. You can replicate the NPE in {{spark-shell}} as follows: {noformat} val dsA = Seq((1, 1)).toDF("id", "a") val dsB = Seq((2, 2)).toDF("id", "a") val dsC = Seq((3, 3)).toDF("id", "a") val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), "full_outer"); joined.collectAsList {noformat} I think its because the join column sequence {{idSeq}} (in your unit test) is provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream: {noformat} scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter( Collections.singletonList("id") ).asScala.toSeq; | | res2: Seq[String] = Stream(id, ?) scala> {noformat} This seems to a bug in the handling of the join columns, but only in the case where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, SPARK-38221, SPARK-26680). was (Author: bersprockets): You may have this figured out already, but in case not, here's a clue. You can replicate the NPE in {{spark-shell}} as follows: {noformat} val dsA = Seq((1, 1)).toDF("id", "a") val dsB = Seq((2, 2)).toDF("id", "a") val dsC = Seq((3, 3)).toDF("id", "a") val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), "full_outer"); joined.collectAsList {noformat} I think its because the join column sequence {{idSeq}} (in your unit test) is provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream: {noformat} scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter( Collections.singletonList("id") ).asScala.toSeq; | | res2: Seq[String] = Stream(id, ?) scala> {noformat} This seems to a bug in the handling of the join columns, but only in the case where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, SPARK-38221). > nesting full outer joins confuses code generator > > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. >Reporter: Steven Aerts >Priority: Major > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //< null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark
[jira] [Commented] (SPARK-44132) nesting full outer joins confuses code generator
[ https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735944#comment-17735944 ] Bruce Robbins commented on SPARK-44132: --- You may have this figured out already, but in case not, here's a clue. You can replicate the NPE in {{spark-shell}} as follows: {noformat} val dsA = Seq((1, 1)).toDF("id", "a") val dsB = Seq((2, 2)).toDF("id", "a") val dsC = Seq((3, 3)).toDF("id", "a") val joined = dsA.join(dsB, Stream("id"), "full_outer").join(dsC, Stream("id"), "full_outer"); joined.collectAsList {noformat} I think its because the join column sequence {{idSeq}} (in your unit test) is provided as a {{Stream}}. {{toSeq}} in {{JavaConverters}} returns a Stream: {noformat} scala> scala.collection.JavaConverters.collectionAsScalaIterableConverter( Collections.singletonList("id") ).asScala.toSeq; | | res2: Seq[String] = Stream(id, ?) scala> {noformat} This seems to a bug in the handling of the join columns, but only in the case where it's provided as a {{Stream}} (see similar bugs SPARK-38308, SPARK-38528, SPARK-38221). > nesting full outer joins confuses code generator > > > Key: SPARK-44132 > URL: https://issues.apache.org/jira/browse/SPARK-44132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 > Environment: We verified the existence of this bug from spark 3.3 > until spark 3.5. >Reporter: Steven Aerts >Priority: Major > > We are seeing issues with the code generator when querying java bean encoded > data with 2 nested joins. > {code:java} > dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); > {code} > will generate invalid code in the code generator. And can depending on the > data used generate stack traces like: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > Or: > {code:java} > Caused by: java.lang.AssertionError: index (2) should < 2 > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > {code} > When we look at the generated code we see that the code generator seems to be > mixing up parameters. For example: > {code:java} > if (smj_leftOutputRow_0 != null) { //< null > check for wrong/left parameter > boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //< causes > NPE on right parameter here{code} > It is as if the the nesting of 2 full outer joins is confusing the code > generator and as such generating invalid code. > There is one other strange thing. We found this issue when using data sets > which were using the java bean encoder. We tried to reproduce this in the > spark shell or using scala case classes but were unable to do so. > We made a reproduction scenario as unit tests (one for each of the stacktrace > above) on the spark code base and made it available as a [pull > request|https://github.com/apache/spark/pull/41688] to this case. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44040) Incorrect result after count distinct
[ https://issues.apache.org/jira/browse/SPARK-44040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732163#comment-17732163 ] Bruce Robbins commented on SPARK-44040: --- It seems this can be reproduced in {{spark-sql}} as well. Interestingly, turning off AQE seems to fix the issue (for both the above dataframe version and the below SQL version): {noformat} spark-sql (default)> create or replace temp view v1 as select 1 as c1 limit 0; Time taken: 0.959 seconds spark-sql (default)> create or replace temp view agg1 as select sum(c1) as c1, "agg1" as name from v1; Time taken: 0.16 seconds spark-sql (default)> create or replace temp view agg2 as select sum(c1) as c1, "agg2" as name from v1; Time taken: 0.035 seconds spark-sql (default)> create or replace temp view union1 as select * from agg1 union select * from agg2; Time taken: 0.088 seconds spark-sql (default)> -- the following incorrectly produces 2 rows select distinct c1 from union1; NULL NULL Time taken: 1.649 seconds, Fetched 2 row(s) spark-sql (default)> set spark.sql.adaptive.enabled=false; spark.sql.adaptive.enabled false Time taken: 0.019 seconds, Fetched 1 row(s) spark-sql (default)> -- the following correctly produces 1 row select distinct c1 from union1; NULL Time taken: 1.372 seconds, Fetched 1 row(s) spark-sql (default)> {noformat} > Incorrect result after count distinct > - > > Key: SPARK-44040 > URL: https://issues.apache.org/jira/browse/SPARK-44040 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Aleksandr Aleksandrov >Priority: Critical > > When i try to call count after distinct function for Decimal null field, > spark return incorrect result starting from spark 3.4.0. > A minimal example to reproduce: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.\{Column, DataFrame, Dataset, Row, SparkSession} > import org.apache.spark.sql.types.\{StringType, StructField, StructType} > val schema = StructType( Array( > StructField("money", DecimalType(38,6), true), > StructField("reference_id", StringType, true) > )) > val payDf = spark.createDataFrame(sc.emptyRDD[Row], schema) > val aggDf = payDf.agg(sum("money").as("money")).withColumn("name", lit("df1")) > val aggDf1 = payDf.agg(sum("money").as("money")).withColumn("name", > lit("df2")) > val unionDF: DataFrame = aggDf.union(aggDf1) > unionDF.select("money").distinct.show // return correct result > unionDF.select("money").distinct.count // return 2 instead of 1 > unionDF.select("money").distinct.count == 1 // return false > This block of code returns some assertion error and after that an incorrect > count (in spark 3.2.1 everything works fine and i get correct result = 1): > *scala> unionDF.select("money").distinct.show // return correct result* > java.lang.AssertionError: assertion failed: > Decimal$DecimalIsFractional > while compiling: > during phase: globalPhase=terminal, enteringPhase=jvm > library version: version 2.12.17 > compiler version: version 2.12.17 > reconstructed args: -classpath > /Users/aleksandrov/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-core_2.12-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-storage-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar:/Users/aleksandrov/.ivy2/jars/org.antlr_antlr4-runtime-4.9.3.jar > -Yrepl-class-based -Yrepl-outdir > /private/var/folders/qj/_dn4xbp14jn37qmdk7ylyfwcgr/T/spark-f37bb154-75f3-4db7-aea8-3c4363377bd8/repl-350f37a1-1df1-4816-bd62-97929c60a6c1 > last tree to typer: TypeTree(class Byte) > tree position: line 6 of > tree tpe: Byte > symbol: (final abstract) class Byte in package scala > symbol definition: final abstract class Byte extends (a ClassSymbol) > symbol package: scala > symbol owners: class Byte > call site: constructor $eval in object $eval in package $line19 > == Source file context for tree position == > 3 > 4object $eval { > 5lazyval $result = > $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 > 6lazyval $print: {_}root{_}.java.lang.String = { > 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw > 8 > 9"" > at > scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) > at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) > at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) > at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) > at > scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) > at >
[jira] [Resolved] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins resolved SPARK-43843. --- Resolution: Invalid > Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError > --- > > Key: SPARK-43843 > URL: https://issues.apache.org/jira/browse/SPARK-43843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 > Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, > Java 11.0.12) >Reporter: Bruce Robbins >Priority: Major > > I launched spark-shell as so: > {noformat} > bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | > grep -v test | head -1` > {noformat} > I got the below error trying to create an AVRO file: > {noformat} > scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b") > val df = Seq((1, 2), (3, 4)).toDF("a", "b") > val df: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> df.write.mode("overwrite").format("avro").save("avro_file") > df.write.mode("overwrite").format("avro").save("avro_file") > java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps > at > org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120) > ... > scala> > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726988#comment-17726988 ] Bruce Robbins commented on SPARK-43843: --- Nevermind, I had an old {{spark-avro_2.12-3.5.0-SNAPSHOT.jar}} laying about in my {{work}} directory which the find in my {{--jars}} value found first. > Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError > --- > > Key: SPARK-43843 > URL: https://issues.apache.org/jira/browse/SPARK-43843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 > Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, > Java 11.0.12) >Reporter: Bruce Robbins >Priority: Major > > I launched spark-shell as so: > {noformat} > bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | > grep -v test | head -1` > {noformat} > I got the below error trying to create an AVRO file: > {noformat} > scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b") > val df = Seq((1, 2), (3, 4)).toDF("a", "b") > val df: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> df.write.mode("overwrite").format("avro").save("avro_file") > df.write.mode("overwrite").format("avro").save("avro_file") > java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps > at > org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120) > ... > scala> > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43841) Non-existent column in projection of full outer join with USING results in StringIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-43841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726980#comment-17726980 ] Bruce Robbins commented on SPARK-43841: --- PR at https://github.com/apache/spark/pull/41353 > Non-existent column in projection of full outer join with USING results in > StringIndexOutOfBoundsException > -- > > Key: SPARK-43841 > URL: https://issues.apache.org/jira/browse/SPARK-43841 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Minor > > The following query throws a {{StringIndexOutOfBoundsException}}: > {noformat} > with v1 as ( > select * from values (1, 2) as (c1, c2) > ), > v2 as ( > select * from values (2, 3) as (c1, c2) > ) > select v1.c1, v1.c2, v2.c1, v2.c2, b > from v1 > full outer join v2 > using (c1); > {noformat} > The query should fail anyway, since {{b}} refers to a non-existent column. > But it should fail with a helpful error message, not with a > {{StringIndexOutOfBoundsException}}. > The issue seems to be in > {{StringUtils#orderSuggestedIdentifiersBySimilarity}}. > {{orderSuggestedIdentifiersBySimilarity}} assumes that a list of candidate > attributes with a mix of prefixes will never have an attribute name with an > empty prefix. But in this case it does ({{c1}} from the {{coalesce}} has no > prefix, since it is not associated with any relation or subquery): > {noformat} > +- 'Project [c1#5, c2#6, c1#7, c2#8, 'b] >+- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no > prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2) > +- Join FullOuter, (c1#5 = c1#7) > :- SubqueryAlias v1 > : +- CTERelationRef 0, true, [c1#5, c2#6] > +- SubqueryAlias v2 > +- CTERelationRef 1, true, [c1#7, c2#8] > {noformat} > Because of this, {{orderSuggestedIdentifiersBySimilarity}} returns a sorted > list of suggestions like this: > {noformat} > ArrayBuffer(.c1, v1.c2, v2.c2) > {noformat} > {{UnresolvedAttribute.parseAttributeName}} chokes on an attribute name that > starts with a namespace separator ('.'). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
[ https://issues.apache.org/jira/browse/SPARK-43843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43843: -- Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, Java 11.0.12) > Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError > --- > > Key: SPARK-43843 > URL: https://issues.apache.org/jira/browse/SPARK-43843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 > Environment: Scala version 2.13.8 (Java HotSpot(TM) 64-Bit Server VM, > Java 11.0.12) >Reporter: Bruce Robbins >Priority: Major > > I launched spark-shell as so: > {noformat} > bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | > grep -v test | head -1` > {noformat} > I got the below error trying to create an AVRO file: > {noformat} > scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b") > val df = Seq((1, 2), (3, 4)).toDF("a", "b") > val df: org.apache.spark.sql.DataFrame = [a: int, b: int] > scala> df.write.mode("overwrite").format("avro").save("avro_file") > df.write.mode("overwrite").format("avro").save("avro_file") > java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps > at > org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74) > at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) > at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105) > at > org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120) > ... > scala> > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43843) Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError
Bruce Robbins created SPARK-43843: - Summary: Saving an AVRO file with Scala 2.13 results in NoClassDefFoundError Key: SPARK-43843 URL: https://issues.apache.org/jira/browse/SPARK-43843 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins I launched spark-shell as so: {noformat} bin/spark-shell --driver-memory 8g --jars `find . -name "spark-avro*.jar" | grep -v test | head -1` {noformat} I got the below error trying to create an AVRO file: {noformat} scala> val df = Seq((1, 2), (3, 4)).toDF("a", "b") val df = Seq((1, 2), (3, 4)).toDF("a", "b") val df: org.apache.spark.sql.DataFrame = [a: int, b: int] scala> df.write.mode("overwrite").format("avro").save("avro_file") df.write.mode("overwrite").format("avro").save("avro_file") java.lang.NoClassDefFoundError: scala/collection/immutable/StringOps at org.apache.spark.sql.avro.AvroFileFormat.supportFieldName(AvroFileFormat.scala:160) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1(DataSourceUtils.scala:75) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.$anonfun$checkFieldNames$1$adapted(DataSourceUtils.scala:74) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:105) at org.apache.spark.sql.execution.datasources.DataSourceUtils$.checkFieldNames(DataSourceUtils.scala:74) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:120) ... scala> {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43841) Non-existent column in projection of full outer join with USING results in StringIndexOutOfBoundsException
Bruce Robbins created SPARK-43841: - Summary: Non-existent column in projection of full outer join with USING results in StringIndexOutOfBoundsException Key: SPARK-43841 URL: https://issues.apache.org/jira/browse/SPARK-43841 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins The following query throws a {{StringIndexOutOfBoundsException}}: {noformat} with v1 as ( select * from values (1, 2) as (c1, c2) ), v2 as ( select * from values (2, 3) as (c1, c2) ) select v1.c1, v1.c2, v2.c1, v2.c2, b from v1 full outer join v2 using (c1); {noformat} The query should fail anyway, since {{b}} refers to a non-existent column. But it should fail with a helpful error message, not with a {{StringIndexOutOfBoundsException}}. The issue seems to be in {{StringUtils#orderSuggestedIdentifiersBySimilarity}}. {{orderSuggestedIdentifiersBySimilarity}} assumes that a list of candidate attributes with a mix of prefixes will never have an attribute name with an empty prefix. But in this case it does ({{c1}} from the {{coalesce}} has no prefix, since it is not associated with any relation or subquery): {noformat} +- 'Project [c1#5, c2#6, c1#7, c2#8, 'b] +- Project [coalesce(c1#5, c1#7) AS c1#9, c2#6, c2#8] <== c1#9 has no prefix, unlike c2#6 (v1.c2) or c2#8 (v2.c2) +- Join FullOuter, (c1#5 = c1#7) :- SubqueryAlias v1 : +- CTERelationRef 0, true, [c1#5, c2#6] +- SubqueryAlias v2 +- CTERelationRef 1, true, [c1#7, c2#8] {noformat} Because of this, {{orderSuggestedIdentifiersBySimilarity}} returns a sorted list of suggestions like this: {noformat} ArrayBuffer(.c1, v1.c2, v2.c2) {noformat} {{UnresolvedAttribute.parseAttributeName}} chokes on an attribute name that starts with a namespace separator ('.'). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725143#comment-17725143 ] Bruce Robbins commented on SPARK-43718: --- PR here: https://github.com/apache/spark/pull/41267 > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. > Queries that don't use arrays also can get wrong results. Assume this data: > {noformat} > create or replace temp view t1 as values (0), (1), (2) as (c1); > create or replace temp view t2 as values (1), (2), (3) as (c1); > create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); > {noformat} > The following query produces incorrect results: > {noformat} > select t1.c1 as t1_c1, t2.c1 as t2_c1, b > from t1 > full outer join t2 > using (c1), > lateral ( > select b > from t3 > where a = coalesce(t2.c1, 1) > ) lt3; > 1 1 2 > NULL 3 4 > Time taken: 2.395 seconds, Fetched 2 row(s) > spark-sql (default)> > {noformat} > The result should be the following: > {noformat} > 0 NULL2 > 1 1 2 > NULL 3 4 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Description: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces incorrect results: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. Queries that don't use arrays also can get wrong results. Assume this data: {noformat} create or replace temp view t1 as values (0), (1), (2) as (c1); create or replace temp view t2 as values (1), (2), (3) as (c1); create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); {noformat} The following query produces incorrect results: {noformat} select t1.c1 as t1_c1, t2.c1 as t2_c1, b from t1 full outer join t2 using (c1), lateral ( select b from t3 where a = coalesce(t2.c1, 1) ) lt3; 1 1 2 NULL3 4 Time taken: 2.395 seconds, Fetched 2 row(s) spark-sql (default)> {noformat} The result should be the following: {noformat} 0 NULL2 1 1 2 NULL3 4 {noformat} was: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces incorrect results: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. > Queries that don't use arrays also can get wrong results. Assume this data: > {noformat} > create or replace temp view t1 as values (0), (1), (2) as (c1); > create or replace temp view t2 as values (1), (2), (3) as (c1); > create or replace temp view t3 as values (1, 2), (3, 4), (4, 5) as (a, b); > {noformat} > The following query produces incorrect results: > {noformat} > select t1.c1 as t1_c1, t2.c1 as t2_c1, b > from t1 > full outer join t2 > using (c1), > lateral ( > select b > from t3 > where a = coalesce(t2.c1, 1) > ) lt3; > 1 1 2 > NULL 3 4 > Time taken: 2.395 seconds, Fetched 2 row(s) > spark-sql (default)> > {noformat} > The result should be the following: > {noformat} > 0 NULL2 > 1 1 2 > NULL 3 4 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Affects Version/s: 3.3.2 > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Affects Version/s: 3.4.0 > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725122#comment-17725122 ] Bruce Robbins commented on SPARK-43718: --- I think I have a handle on this. I will submit in a PR in the coming days. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Labels: correctness (was: ) > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Description: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces incorrect results: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. was: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces incorrect results: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
[ https://issues.apache.org/jira/browse/SPARK-43718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43718: -- Description: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is resolved, so the array's {{containsNull}} value is incorrect. was: Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. > References to a specific side's key in a USING join can have wrong nullability > -- > > Key: SPARK-43718 > URL: https://issues.apache.org/jira/browse/SPARK-43718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Assume this data: > {noformat} > create or replace temp view t1 as values (1), (2), (3) as (c1); > create or replace temp view t2 as values (2), (3), (4) as (c1); > {noformat} > The following query produces the wrong result: > {noformat} > spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 > from t1 > full outer join t2 > using (c1); > 1 > -1 <== should be null > 2 > 2 > 3 > 3 > -1 <== should be null > 4 > Time taken: 0.663 seconds, Fetched 8 row(s) > spark-sql (default)> > {noformat} > Similar issues occur with right outer join and left outer join. > {{t1.c1}} and {{t2.c1}} have the wrong nullability at the time the array is > resolved, so the array's {{containsNull}} value is incorrect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43718) References to a specific side's key in a USING join can have wrong nullability
Bruce Robbins created SPARK-43718: - Summary: References to a specific side's key in a USING join can have wrong nullability Key: SPARK-43718 URL: https://issues.apache.org/jira/browse/SPARK-43718 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins Assume this data: {noformat} create or replace temp view t1 as values (1), (2), (3) as (c1); create or replace temp view t2 as values (2), (3), (4) as (c1); {noformat} The following query produces the wrong result: {noformat} spark-sql (default)> select explode(array(t1.c1, t2.c1)) as x1 from t1 full outer join t2 using (c1); 1 -1 <== should be null 2 2 3 3 -1 <== should be null 4 Time taken: 0.663 seconds, Fetched 8 row(s) spark-sql (default)> {noformat} Similar issues occur with right outer join and left outer join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43149) When CREATE USING fails to store metadata in metastore, data gets left around
Bruce Robbins created SPARK-43149: - Summary: When CREATE USING fails to store metadata in metastore, data gets left around Key: SPARK-43149 URL: https://issues.apache.org/jira/browse/SPARK-43149 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins For example: {noformat} drop table if exists parquet_ds1; -- try creating table with invalid column name -- use 'using parquet' to designate the data source create table parquet_ds1 using parquet as select id, date'2018-01-01' + make_dt_interval(0, id) from range(0, 10); Cannot create a table having a column whose name contains commas in Hive metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE '2018-01-01' + make_dt_interval(0, id, 0, 0.00) -- show that table did not get created show tables; -- try again with valid column name -- spark will complain that directory already exists create table parquet_ds1 using parquet as select id, date'2018-01-01' + make_dt_interval(0, id) as ts from range(0, 10); [LOCATION_ALREADY_EXISTS] Cannot name the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its associated location 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already exists. Please pick a different table name, or remove the existing location first. org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its associated location 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already exists. Please pick a different table name, or remove the existing location first. at org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414) at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) ... {noformat} One must manually remove the directory {{spark-warehouse/parquet_ds1}} before the {{create table}} command will succeed. It seems that datasource table creation runs the data-creation job first, then stores the metadata into the metastore. When using Spark to create Hive tables, the issue does not happen: {noformat} drop table if exists parquet_hive1; -- try creating table with invalid column name, -- but use 'stored as parquet' instead of 'using' create table parquet_hive1 stored as parquet as select id, date'2018-01-01' + make_dt_interval(0, id) from range(0, 10); Cannot create a table having a column whose name contains commas in Hive metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE '2018-01-01' + make_dt_interval(0, id, 0, 0.00) -- try again with valid column name. This will succeed; create table parquet_hive1 stored as parquet as select id, date'2018-01-01' + make_dt_interval(0, id) as ts from range(0, 10); {noformat} It seems that Hive table creation stores metadata into the metastore first, then runs the data-creation job. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43149) When CTAS with USING fails to store metadata in metastore, data gets left around
[ https://issues.apache.org/jira/browse/SPARK-43149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-43149: -- Summary: When CTAS with USING fails to store metadata in metastore, data gets left around (was: When CREATE USING fails to store metadata in metastore, data gets left around) > When CTAS with USING fails to store metadata in metastore, data gets left > around > > > Key: SPARK-43149 > URL: https://issues.apache.org/jira/browse/SPARK-43149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > For example: > {noformat} > drop table if exists parquet_ds1; > -- try creating table with invalid column name > -- use 'using parquet' to designate the data source > create table parquet_ds1 using parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) > from range(0, 10); > Cannot create a table having a column whose name contains commas in Hive > metastore. Table: `spark_catalog`.`default`.`parquet_ds1`; Column: DATE > '2018-01-01' + make_dt_interval(0, id, 0, 0.00) > -- show that table did not get created > show tables; > -- try again with valid column name > -- spark will complain that directory already exists > create table parquet_ds1 using parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) as ts > from range(0, 10); > [LOCATION_ALREADY_EXISTS] Cannot name the managed table as > `spark_catalog`.`default`.`parquet_ds1`, as its associated location > 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already > exists. Please pick a different table name, or remove the existing location > first. > org.apache.spark.SparkRuntimeException: [LOCATION_ALREADY_EXISTS] Cannot name > the managed table as `spark_catalog`.`default`.`parquet_ds1`, as its > associated location > 'file:/Users/bruce/github/spark_upstream/spark-warehouse/parquet_ds1' already > exists. Please pick a different table name, or remove the existing location > first. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.locationAlreadyExists(QueryExecutionErrors.scala:2804) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:414) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176) > ... > {noformat} > One must manually remove the directory {{spark-warehouse/parquet_ds1}} before > the {{create table}} command will succeed. > It seems that datasource table creation runs the data-creation job first, > then stores the metadata into the metastore. > When using Spark to create Hive tables, the issue does not happen: > {noformat} > drop table if exists parquet_hive1; > -- try creating table with invalid column name, > -- but use 'stored as parquet' instead of 'using' > create table parquet_hive1 stored as parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) > from range(0, 10); > Cannot create a table having a column whose name contains commas in Hive > metastore. Table: `spark_catalog`.`default`.`parquet_hive1`; Column: DATE > '2018-01-01' + make_dt_interval(0, id, 0, 0.00) > -- try again with valid column name. This will succeed; > create table parquet_hive1 stored as parquet as > select id, date'2018-01-01' + make_dt_interval(0, id) as ts > from range(0, 10); > {noformat} > It seems that Hive table creation stores metadata into the metastore first, > then runs the data-creation job. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column
[ https://issues.apache.org/jira/browse/SPARK-43113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711614#comment-17711614 ] Bruce Robbins edited comment on SPARK-43113 at 4/14/23 6:02 AM: PR here: https://github.com/apache/spark/pull/40766 was (Author: bersprockets): PR here: https://github.com/apache/spark/pull/40766/files > Codegen error when full outer join's bound condition has multiple references > to the same stream-side column > --- > > Key: SPARK-43113 > URL: https://issues.apache.org/jira/browse/SPARK-43113 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Example # 1 (sort merge join): > {noformat} > create or replace temp view v1 as > select * from values > (1, 1), > (2, 2), > (3, 1) > as v1(key, value); > create or replace temp view v2 as > select * from values > (1, 22, 22), > (3, -1, -1), > (7, null, null) > as v2(a, b, c); > select * > from v1 > full outer join v2 > on key = a > and value > b > and value > c; > {noformat} > The join's generated code causes the following compilation error: > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 277, Column 9: Redefinition of local variable "smj_isNull_7" > {noformat} > Example #2 (shuffle hash join): > {noformat} > select /*+ SHUFFLE_HASH(v2) */ * > from v1 > full outer join v2 > on key = a > and value > b > and value > c; > {noformat} > The shuffle hash join's generated code causes the following compilation error: > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 174, Column 5: Redefinition of local variable "shj_value_1" > {noformat} > With default configuration, both queries end up succeeding, since Spark falls > back to running each query with whole-stage codegen disabled. > The issue happens only when the join's bound condition refers to the same > stream-side column more than once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column
[ https://issues.apache.org/jira/browse/SPARK-43113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711614#comment-17711614 ] Bruce Robbins commented on SPARK-43113: --- PR here: https://github.com/apache/spark/pull/40766/files > Codegen error when full outer join's bound condition has multiple references > to the same stream-side column > --- > > Key: SPARK-43113 > URL: https://issues.apache.org/jira/browse/SPARK-43113 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Example # 1 (sort merge join): > {noformat} > create or replace temp view v1 as > select * from values > (1, 1), > (2, 2), > (3, 1) > as v1(key, value); > create or replace temp view v2 as > select * from values > (1, 22, 22), > (3, -1, -1), > (7, null, null) > as v2(a, b, c); > select * > from v1 > full outer join v2 > on key = a > and value > b > and value > c; > {noformat} > The join's generated code causes the following compilation error: > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 277, Column 9: Redefinition of local variable "smj_isNull_7" > {noformat} > Example #2 (shuffle hash join): > {noformat} > select /*+ SHUFFLE_HASH(v2) */ * > from v1 > full outer join v2 > on key = a > and value > b > and value > c; > {noformat} > The shuffle hash join's generated code causes the following compilation error: > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 174, Column 5: Redefinition of local variable "shj_value_1" > {noformat} > With default configuration, both queries end up succeeding, since Spark falls > back to running each query with whole-stage codegen disabled. > The issue happens only when the join's bound condition refers to the same > stream-side column more than once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43113) Codegen error when full outer join's bound condition has multiple references to the same stream-side column
Bruce Robbins created SPARK-43113: - Summary: Codegen error when full outer join's bound condition has multiple references to the same stream-side column Key: SPARK-43113 URL: https://issues.apache.org/jira/browse/SPARK-43113 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.4.0, 3.5.0 Reporter: Bruce Robbins Example # 1 (sort merge join): {noformat} create or replace temp view v1 as select * from values (1, 1), (2, 2), (3, 1) as v1(key, value); create or replace temp view v2 as select * from values (1, 22, 22), (3, -1, -1), (7, null, null) as v2(a, b, c); select * from v1 full outer join v2 on key = a and value > b and value > c; {noformat} The join's generated code causes the following compilation error: {noformat} org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 277, Column 9: Redefinition of local variable "smj_isNull_7" {noformat} Example #2 (shuffle hash join): {noformat} select /*+ SHUFFLE_HASH(v2) */ * from v1 full outer join v2 on key = a and value > b and value > c; {noformat} The shuffle hash join's generated code causes the following compilation error: {noformat} org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 5: Redefinition of local variable "shj_value_1" {noformat} With default configuration, both queries end up succeeding, since Spark falls back to running each query with whole-stage codegen disabled. The issue happens only when the join's bound condition refers to the same stream-side column more than once. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[ https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705702#comment-17705702 ] Bruce Robbins commented on SPARK-42937: --- PR at https://github.com/apache/spark/pull/40569 > Join with subquery in condition can fail with wholestage codegen and adaptive > execution disabled > > > Key: SPARK-42937 > URL: https://issues.apache.org/jira/browse/SPARK-42937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The below left outer join gets an error: > {noformat} > create or replace temp view v1 as > select * from values > (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), > (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) > as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, > value9, value10); > create or replace temp view v2 as > select * from values > (1, 2), > (3, 8), > (7, 9) > as v2(a, b); > create or replace temp view v3 as > select * from values > (3), > (8) > as v3(col1); > set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 > set spark.sql.adaptive.enabled=false; > select * > from v1 > left outer join v2 > on key = a > and key in (select col1 from v3); > {noformat} > The join fails during predicate codegen: > {noformat} > 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to > interpreter mode > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) > {noformat} > It fails again after fallback to interpreter mode: > {noformat} > 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) > {noformat} > Both the predicate codegen and the evaluation fail for the same reason: > {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. > The driver waits for the subquery to finish, but it's the executor that uses > the results of the subquery (for predicate codegen
[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[ https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42937: -- Affects Version/s: 3.4.0 > Join with subquery in condition can fail with wholestage codegen and adaptive > execution disabled > > > Key: SPARK-42937 > URL: https://issues.apache.org/jira/browse/SPARK-42937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The below left outer join gets an error: > {noformat} > create or replace temp view v1 as > select * from values > (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), > (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) > as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, > value9, value10); > create or replace temp view v2 as > select * from values > (1, 2), > (3, 8), > (7, 9) > as v2(a, b); > create or replace temp view v3 as > select * from values > (3), > (8) > as v3(col1); > set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 > set spark.sql.adaptive.enabled=false; > select * > from v1 > left outer join v2 > on key = a > and key in (select col1 from v3); > {noformat} > The join fails during predicate codegen: > {noformat} > 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to > interpreter mode > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) > {noformat} > It fails again after fallback to interpreter mode: > {noformat} > 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) > {noformat} > Both the predicate codegen and the evaluation fail for the same reason: > {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. > The driver waits for the subquery to finish, but it's the executor that uses > the results of the subquery (for predicate codegen or evaluation). Because > {{shouldBroadcast}} is set to
[jira] [Updated] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
[ https://issues.apache.org/jira/browse/SPARK-42937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42937: -- Affects Version/s: 3.3.2 > Join with subquery in condition can fail with wholestage codegen and adaptive > execution disabled > > > Key: SPARK-42937 > URL: https://issues.apache.org/jira/browse/SPARK-42937 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > The below left outer join gets an error: > {noformat} > create or replace temp view v1 as > select * from values > (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), > (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), > (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) > as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, > value9, value10); > create or replace temp view v2 as > select * from values > (1, 2), > (3, 8), > (7, 9) > as v2(a, b); > create or replace temp view v3 as > select * from values > (3), > (8) > as v3(col1); > set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 > set spark.sql.adaptive.enabled=false; > select * > from v1 > left outer join v2 > on key = a > and key in (select col1 from v3); > {noformat} > The join fails during predicate codegen: > {noformat} > 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to > interpreter mode > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) > at > org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) > at > org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) > at > org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) > {noformat} > It fails again after fallback to interpreter mode: > {noformat} > 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > java.lang.IllegalArgumentException: requirement failed: input[0, int, false] > IN subquery#34 has not finished > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) > at > org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) > at > org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) > at > org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) > {noformat} > Both the predicate codegen and the evaluation fail for the same reason: > {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. > The driver waits for the subquery to finish, but it's the executor that uses > the results of the subquery (for predicate codegen or evaluation). Because > {{shouldBroadcast}} is set to false, the
[jira] [Created] (SPARK-42937) Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled
Bruce Robbins created SPARK-42937: - Summary: Join with subquery in condition can fail with wholestage codegen and adaptive execution disabled Key: SPARK-42937 URL: https://issues.apache.org/jira/browse/SPARK-42937 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins The below left outer join gets an error: {noformat} create or replace temp view v1 as select * from values (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10); create or replace temp view v2 as select * from values (1, 2), (3, 8), (7, 9) as v2(a, b); create or replace temp view v3 as select * from values (3), (8) as v3(col1); set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100 set spark.sql.adaptive.enabled=false; select * from v1 left outer join v2 on key = a and key in (select col1 from v3); {noformat} The join fails during predicate codegen: {noformat} 23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to interpreter mode java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73) at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70) at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51) at org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40) at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40) {noformat} It fails again after fallback to interpreter mode: {noformat} 23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144) at org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151) at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146) at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205) {noformat} Both the predicate codegen and the evaluation fail for the same reason: {{PlanSubqueries}} creates {{InSubqueryExec}} with {{shouldBroadcast=false}}. The driver waits for the subquery to finish, but it's the executor that uses the results of the subquery (for predicate codegen or evaluation). Because {{shouldBroadcast}} is set to false, the result is stored in a transient field ({{InSubqueryExec#result}}), so the result of the subquery is not serialized when the {{InSubqueryExec}} instance is sent to the executor. When wholestage codegen is enabled, the predicate codegen happens on the driver, so the subquery's result is available. When adaptive execution is enabled, {{PlanAdaptiveSubqueries}} always sets {{shouldBroadcast=true}}, so the
[jira] [Commented] (SPARK-42909) INSERT INTO with column list does not work
[ https://issues.apache.org/jira/browse/SPARK-42909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17704368#comment-17704368 ] Bruce Robbins commented on SPARK-42909: --- It looks like this capability landed in 3.4/3.5 with SPARK-42521. > INSERT INTO with column list does not work > -- > > Key: SPARK-42909 > URL: https://issues.apache.org/jira/browse/SPARK-42909 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 > Environment: Databricks DBR12.2 on AZure, running Spark 3.3.2 > Documentation: [INSERT - Azure Databricks - Databricks SQL | Microsoft > Learn|https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-dml-insert-into] >Reporter: Tjomme Vergauwen >Priority: Major > Labels: databricks, documentation, spark-sql, sql > > Hi, > When performing a INSERT INTO with a defined incomplete column list, the > missing columns should get a NULL value. However, an error is thrown > indicating that the column is missing. > *Case simulation:* > drop table if exists default.TVTest; > create table default.TVTest > ( col1 int NOT NULL > , col2 int > ); > insert into default.TVTest select 1,2; > insert into default.TVTest select 2,NULL; --> col2 can contain NULL values > insert into default.TVTest (col1) select 3; -- Error in SQL statement: > DeltaAnalysisException: Column col2 is not specified in INSERT > insert into default.TVTest (col1) VALUES (3); -- Error in SQL statement: > DeltaAnalysisException: Column col2 is not specified in INSERT > select * from default.TVTest; -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append
[ https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688759#comment-17688759 ] Bruce Robbins commented on SPARK-42401: --- There is another case: {noformat} spark-sql> select array_insert(array('1', '2', '3', '4'), -6, '5'); 23/02/14 16:10:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) {noformat} {{array_insert}} might implicitly add nulls, and my fix does not cover that case. I will follow up. > Incorrect results or NPE when inserting null value into array using > array_insert/array_append > - > > Key: SPARK-42401 > URL: https://issues.apache.org/jira/browse/SPARK-42401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.4.0 > > > Example: > {noformat} > create or replace temp view v1 as > select * from values > (array(1, 2, 3, 4), 5, 5), > (array(1, 2, 3, 4), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > This produces an incorrect result: > {noformat} > [1,2,3,4,5] > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > A more succint example: > {noformat} > select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); > {noformat} > This also produces an incorrect result: > {noformat} > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > Another example: > {noformat} > create or replace temp view v1 as > select * from values > (array('1', '2', '3', '4'), 5, '5'), > (array('1', '2', '3', '4'), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > The above query throws a {{NullPointerException}}: > {noformat} > 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, > col2, col3) from v1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) > {noformat} > {{array_append}} has the same issue: > {noformat} > spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); > [1,2,3,4,0] <== should be [1,2,3,4,null] > Time taken: 3.679 seconds, Fetched 1 row(s) > spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as > string)); > 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append
[ https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42401: -- Summary: Incorrect results or NPE when inserting null value into array using array_insert/array_append (was: Incorrect results or NPE when inserting null value using array_insert/array_append) > Incorrect results or NPE when inserting null value into array using > array_insert/array_append > - > > Key: SPARK-42401 > URL: https://issues.apache.org/jira/browse/SPARK-42401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > create or replace temp view v1 as > select * from values > (array(1, 2, 3, 4), 5, 5), > (array(1, 2, 3, 4), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > This produces an incorrect result: > {noformat} > [1,2,3,4,5] > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > A more succint example: > {noformat} > select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); > {noformat} > This also produces an incorrect result: > {noformat} > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > Another example: > {noformat} > create or replace temp view v1 as > select * from values > (array('1', '2', '3', '4'), 5, '5'), > (array('1', '2', '3', '4'), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > The above query throws a {{NullPointerException}}: > {noformat} > 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, > col2, col3) from v1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) > {noformat} > {{array_append}} has the same issue: > {noformat} > spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); > [1,2,3,4,0] <== should be [1,2,3,4,null] > Time taken: 3.679 seconds, Fetched 1 row(s) > spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as > string)); > 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42401) Incorrect results or NPE when inserting null value using array_insert/array_append
[ https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42401: -- Labels: correctness (was: ) > Incorrect results or NPE when inserting null value using > array_insert/array_append > -- > > Key: SPARK-42401 > URL: https://issues.apache.org/jira/browse/SPARK-42401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > create or replace temp view v1 as > select * from values > (array(1, 2, 3, 4), 5, 5), > (array(1, 2, 3, 4), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > This produces an incorrect result: > {noformat} > [1,2,3,4,5] > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > A more succint example: > {noformat} > select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); > {noformat} > This also produces an incorrect result: > {noformat} > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > Another example: > {noformat} > create or replace temp view v1 as > select * from values > (array('1', '2', '3', '4'), 5, '5'), > (array('1', '2', '3', '4'), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > The above query throws a {{NullPointerException}}: > {noformat} > 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, > col2, col3) from v1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) > {noformat} > {{array_append}} has the same issue: > {noformat} > spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); > [1,2,3,4,0] <== should be [1,2,3,4,null] > Time taken: 3.679 seconds, Fetched 1 row(s) > spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as > string)); > 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42401) Incorrect results or NPE when inserting null value using array_insert/array_append
Bruce Robbins created SPARK-42401: - Summary: Incorrect results or NPE when inserting null value using array_insert/array_append Key: SPARK-42401 URL: https://issues.apache.org/jira/browse/SPARK-42401 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0, 3.5.0 Reporter: Bruce Robbins Example: {noformat} create or replace temp view v1 as select * from values (array(1, 2, 3, 4), 5, 5), (array(1, 2, 3, 4), 5, null) as v1(col1,col2,col3); select array_insert(col1, col2, col3) from v1; {noformat} This produces an incorrect result: {noformat} [1,2,3,4,5] [1,2,3,4,0] <== should be [1,2,3,4,null] {noformat} A more succint example: {noformat} select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); {noformat} This also produces an incorrect result: {noformat} [1,2,3,4,0] <== should be [1,2,3,4,null] {noformat} Another example: {noformat} create or replace temp view v1 as select * from values (array('1', '2', '3', '4'), 5, '5'), (array('1', '2', '3', '4'), 5, null) as v1(col1,col2,col3); select array_insert(col1, col2, col3) from v1; {noformat} The above query throws a {{NullPointerException}}: {noformat} 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, col2, col3) from v1] java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) {noformat} {{array_append}} has the same issue: {noformat} spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); [1,2,3,4,0] <== should be [1,2,3,4,null] Time taken: 3.679 seconds, Fetched 1 row(s) spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as string)); 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42384) Mask function's generated code does not handle null input
[ https://issues.apache.org/jira/browse/SPARK-42384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42384: -- Affects Version/s: 3.4.0 > Mask function's generated code does not handle null input > - > > Key: SPARK-42384 > URL: https://issues.apache.org/jira/browse/SPARK-42384 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > > Example: > {noformat} > create or replace temp view v1 as > select * from values > (null), > ('AbCD123-@$#') > as data(col1); > cache table v1; > select mask(col1) from v1; > {noformat} > This query results in a {{NullPointerException}}: > {noformat} > 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) > {noformat} > The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of > whether {{Mask.transformInput}} returns null or not. The > {{UnsafeWriter.write}} method for {{UTF8String}} does not expect a null > pointer. > {noformat} > /* 031 */ boolean isNull_1 = i.isNullAt(0); > /* 032 */ UTF8String value_1 = isNull_1 ? > /* 033 */ null : (i.getUTF8String(0)); > /* 034 */ > /* 035 */ > /* 036 */ > /* 037 */ > /* 038 */ UTF8String value_0 = null; > /* 039 */ value_0 = > org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, > ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* > literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) > references[3] /* literal */));; > /* 040 */ if (false) { > /* 041 */ mutableStateArray_0[0].setNullAt(0); > /* 042 */ } else { > /* 043 */ mutableStateArray_0[0].write(0, value_0); > /* 044 */ } > /* 045 */ return (mutableStateArray_0[0].getRow()); > /* 046 */ } > {noformat} > The bug is not exercised by a literal null input value, since there appears > to be some optimization that simply replaces the entire function call with a > null literal: > {noformat} > spark-sql> explain SELECT mask(NULL); > == Physical Plan == > *(1) Project [null AS mask(NULL, X, x, n, NULL)#47] > +- *(1) Scan OneRowRelation[] > Time taken: 0.026 seconds, Fetched 1 row(s) > spark-sql> SELECT mask(NULL); > NULL > Time taken: 0.042 seconds, Fetched 1 row(s) > spark-sql> > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42384) Mask function's generated code does not handle null input
Bruce Robbins created SPARK-42384: - Summary: Mask function's generated code does not handle null input Key: SPARK-42384 URL: https://issues.apache.org/jira/browse/SPARK-42384 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Bruce Robbins Example: {noformat} create or replace temp view v1 as select * from values (null), ('AbCD123-@$#') as data(col1); cache table v1; select mask(col1) from v1; {noformat} This query results in a {{NullPointerException}}: {noformat} 23/02/07 16:36:06 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) {noformat} The generated code calls {{UnsafeWriter.write(0, value_0)}} regardless of whether {{Mask.transformInput}} returns null or not. The {{UnsafeWriter.write}} method for {{UTF8String}} does not expect a null pointer. {noformat} /* 031 */ boolean isNull_1 = i.isNullAt(0); /* 032 */ UTF8String value_1 = isNull_1 ? /* 033 */ null : (i.getUTF8String(0)); /* 034 */ /* 035 */ /* 036 */ /* 037 */ /* 038 */ UTF8String value_0 = null; /* 039 */ value_0 = org.apache.spark.sql.catalyst.expressions.Mask.transformInput(value_1, ((UTF8String) references[0] /* literal */), ((UTF8String) references[1] /* literal */), ((UTF8String) references[2] /* literal */), ((UTF8String) references[3] /* literal */));; /* 040 */ if (false) { /* 041 */ mutableStateArray_0[0].setNullAt(0); /* 042 */ } else { /* 043 */ mutableStateArray_0[0].write(0, value_0); /* 044 */ } /* 045 */ return (mutableStateArray_0[0].getRow()); /* 046 */ } {noformat} The bug is not exercised by a literal null input value, since there appears to be some optimization that simply replaces the entire function call with a null literal: {noformat} spark-sql> explain SELECT mask(NULL); == Physical Plan == *(1) Project [null AS mask(NULL, X, x, n, NULL)#47] +- *(1) Scan OneRowRelation[] Time taken: 0.026 seconds, Fetched 1 row(s) spark-sql> SELECT mask(NULL); NULL Time taken: 0.042 seconds, Fetched 1 row(s) spark-sql> {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41991) Interpreted mode subexpression elimination can throw exception during insert
[ https://issues.apache.org/jira/browse/SPARK-41991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-41991: -- Affects Version/s: 3.3.1 > Interpreted mode subexpression elimination can throw exception during insert > > > Key: SPARK-41991 > URL: https://issues.apache.org/jira/browse/SPARK-41991 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1, 3.4.0 >Reporter: Bruce Robbins >Priority: Major > > Example: > {noformat} > drop table if exists tbl1; > create table tbl1 (a int, b int) using parquet; > set spark.sql.codegen.wholeStage=false; > set spark.sql.codegen.factoryMode=NO_CODEGEN; > insert into tbl1 > select id as a, id as b > from range(1, 5); > {noformat} > This results in the following exception: > {noformat} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.ExpressionProxy cannot be cast to > org.apache.spark.sql.catalyst.expressions.Cast > at > org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2514) > at > org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2512) > {noformat} > The query produces 2 bigint values, but the table's schema expects 2 int > values, so Spark wraps each output field with a {{Cast}}. > Later, in {{InterpretedUnsafeProjection}}, {{prepareExpressions}} tries to > wrap the two {{Cast}} expressions with an {{ExpressionProxy}}. However, the > parent expression of each {{Cast}} is a {{CheckOverflowInTableInsert}} > expression, which does not accept {{ExpressionProxy}} as a child. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41991) Interpreted mode subexpression elimination can throw exception during insert
Bruce Robbins created SPARK-41991: - Summary: Interpreted mode subexpression elimination can throw exception during insert Key: SPARK-41991 URL: https://issues.apache.org/jira/browse/SPARK-41991 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Bruce Robbins Example: {noformat} drop table if exists tbl1; create table tbl1 (a int, b int) using parquet; set spark.sql.codegen.wholeStage=false; set spark.sql.codegen.factoryMode=NO_CODEGEN; insert into tbl1 select id as a, id as b from range(1, 5); {noformat} This results in the following exception: {noformat} java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.ExpressionProxy cannot be cast to org.apache.spark.sql.catalyst.expressions.Cast at org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2514) at org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2512) {noformat} The query produces 2 bigint values, but the table's schema expects 2 int values, so Spark wraps each output field with a {{Cast}}. Later, in {{InterpretedUnsafeProjection}}, {{prepareExpressions}} tries to wrap the two {{Cast}} expressions with an {{ExpressionProxy}}. However, the parent expression of each {{Cast}} is a {{CheckOverflowInTableInsert}} expression, which does not accept {{ExpressionProxy}} as a child. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41804) InterpretedUnsafeProjection doesn't properly handle an array of UDTs
[ https://issues.apache.org/jira/browse/SPARK-41804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-41804: -- Description: Reproduction steps: {noformat} // create a file of vector data import org.apache.spark.ml.linalg.{DenseVector, Vector} case class TestRow(varr: Array[Vector]) val values = Array(0.1d, 0.2d, 0.3d) val dv = new DenseVector(values).asInstanceOf[Vector] val ds = Seq(TestRow(Array(dv, dv))).toDS ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") // this works spark.read.format("parquet").load("vector_data").collect sql("set spark.sql.codegen.wholeStage=false") sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") // this will get an error spark.read.format("parquet").load("vector_data").collect {noformat} The error varies each time you run it, e.g.: {noformat} Sparse vectors require that the dimension of the indices match the dimension of the values. You provided 2 indices and 6619240 values. {noformat} or {noformat} org.apache.spark.SparkRuntimeException: Error while decoding: java.lang.NegativeArraySizeException {noformat} or {noformat} java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414) {noformat} or {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003 # # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 1.8.0_311-b11) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 compressed oops) # Problematic frame: # V [libjvm.dylib+0xc9d30] acl_CopyRight+0x29 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # //hs_err_pid64213.log Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory (native) total in heap [0x00011efa8890,0x00011efa8be8] = 856 relocation [0x00011efa89b8,0x00011efa89f8] = 64 main code [0x00011efa8a00,0x00011efa8be8] = 488 Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory (native) total in heap [0x00011efa8890,0x00011efa8be8] = 856 relocation [0x00011efa89b8,0x00011efa89f8] = 64 main code [0x00011efa8a00,0x00011efa8be8] = 488 # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # {noformat} was: Reproduction steps: {noformat} // create a file of vector data import org.apache.spark.ml.linalg.{DenseMatrix, DenseVector, Matrix, Vector} case class TestRow(varr: Array[Vector]) val values = Array(0.1d, 0.2d, 0.3d) val dv = new DenseVector(values).asInstanceOf[Vector] val ds = Seq(TestRow(Array(dv, dv))).toDS ds.coalesce(1).write.mode("overwrite").format("parquet").save("vector_data") // this works spark.read.format("parquet").load("vector_data").collect sql("set spark.sql.codegen.wholeStage=false") sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") // this will get an error spark.read.format("parquet").load("vector_data").collect {noformat} The error varies each time you run it, e.g.: {noformat} Sparse vectors require that the dimension of the indices match the dimension of the values. You provided 2 indices and 6619240 values. {noformat} or {noformat} org.apache.spark.SparkRuntimeException: Error while decoding: java.lang.NegativeArraySizeException {noformat} or {noformat} java.lang.OutOfMemoryError: Java heap space at org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.toDoubleArray(UnsafeArrayData.java:414) {noformat} or {noformat} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0xa) at pc=0x0001120c9d30, pid=64213, tid=0x1003 # # JRE version: Java(TM) SE Runtime Environment (8.0_311-b11) (build 1.8.0_311-b11) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.311-b11 mixed mode bsd-amd64 compressed oops) # Problematic frame: # V [libjvm.dylib+0xc9d30] acl_CopyRight+0x29 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # //hs_err_pid64213.log Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory (native) total in heap [0x00011efa8890,0x00011efa8be8] = 856 relocation [0x00011efa89b8,0x00011efa89f8] = 64 main code [0x00011efa8a00,0x00011efa8be8] = 488 Compiled method (nm) 582142 11318 n 0 sun.misc.Unsafe::copyMemory (native) total in heap [0x00011efa8890,0x00011efa8be8] = 856 relocation [0x00011efa89b8,0x00011efa89f8] = 64 main code