[jira] [Resolved] (SPARK-46991) Replace IllegalArgumentException by SparkIllegalArgumentException in catalyst
[ https://issues.apache.org/jira/browse/SPARK-46991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46991. --- Resolution: Fixed Issue resolved by pull request 45033 [https://github.com/apache/spark/pull/45033] > Replace IllegalArgumentException by SparkIllegalArgumentException in catalyst > - > > Key: SPARK-46991 > URL: https://issues.apache.org/jira/browse/SPARK-46991 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Replace all IllegalArgumentException by SparkIllegalArgumentException in > Catalyst code base, and introduce new legacy error classes with the > _LEGACY_ERROR_TEMP_ prefix. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47022) Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency
[ https://issues.apache.org/jira/browse/SPARK-47022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47022. --- Fix Version/s: 3.5.1 Resolution: Fixed Issue resolved by pull request 45081 [https://github.com/apache/spark/pull/45081] > Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency > - > > Key: SPARK-47022 > URL: https://issues.apache.org/jira/browse/SPARK-47022 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47022) Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency
[ https://issues.apache.org/jira/browse/SPARK-47022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47022: - Assignee: Dongjoon Hyun > Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency > - > > Key: SPARK-47022 > URL: https://issues.apache.org/jira/browse/SPARK-47022 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47022) Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency
[ https://issues.apache.org/jira/browse/SPARK-47022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47022: --- Labels: pull-request-available (was: ) > Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency > - > > Key: SPARK-47022 > URL: https://issues.apache.org/jira/browse/SPARK-47022 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47022) Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency
Dongjoon Hyun created SPARK-47022: - Summary: Fix `connect/client/jvm` to have explicit `commons-lang3` test dependency Key: SPARK-47022 URL: https://issues.apache.org/jira/browse/SPARK-47022 Project: Spark Issue Type: Bug Components: Connect, Tests Affects Versions: 3.5.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47021) Fix `kvstore` module to have explicit `commons-lang3` test dependency
[ https://issues.apache.org/jira/browse/SPARK-47021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47021. --- Fix Version/s: 3.4.3 3.5.1 4.0.0 Resolution: Fixed Issue resolved by pull request 45080 [https://github.com/apache/spark/pull/45080] > Fix `kvstore` module to have explicit `commons-lang3` test dependency > - > > Key: SPARK-47021 > URL: https://issues.apache.org/jira/browse/SPARK-47021 > Project: Spark > Issue Type: Bug > Components: Build, Tests >Affects Versions: 3.3.0, 3.4.2, 3.5.0, 3.3.4 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.3, 3.5.1, 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47019) AQE dynamic cache partitioning causes SortMergeJoin to result in data loss
[ https://issues.apache.org/jira/browse/SPARK-47019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ridvan Appa Bugis resolved SPARK-47019. --- Fix Version/s: 3.5.1 Resolution: Fixed > AQE dynamic cache partitioning causes SortMergeJoin to result in data loss > -- > > Key: SPARK-47019 > URL: https://issues.apache.org/jira/browse/SPARK-47019 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.5.0 > Environment: Tested in 3.5.0 > Reproduced on, so far: > * kubernetes deployment > * docker cluster deployment > Local Cluster: > * master > * worker1 (2/2G) > * worker2 (1/1G) >Reporter: Ridvan Appa Bugis >Priority: Blocker > Labels: DAG, caching, correctness, data-loss, > dynamic_allocation, inconsistency, partitioning > Fix For: 3.5.1 > > Attachments: Screenshot 2024-02-07 at 20.09.44.png, Screenshot > 2024-02-07 at 20.10.07.png, eventLogs-app-20240207175940-0023.zip, > testdata.zip > > > It seems like we have encountered an issue with Spark AQE's dynamic cache > partitioning which causes incorrect *count* output values and data loss. > A similar issue could not be found, so i am creating this ticket to raise > awareness. > > Preconditions: > - Setup a cluster as per environment specification > - Prepare test data (or a data large enough to trigger read by both > executors) > Steps to reproduce: > - Read parent > - Self join parent > - cache + materialize parent > - Join parent with child > > Performing a self-join over a parentDF, then caching + materialising the DF, > and then joining it with a childDF results in *incorrect* count value and > {*}missing data{*}. > > Performing a *repartition* seems to fix the issue, most probably due to > rearrangement of the underlying partitions and statistic update. > > This behaviour is observed over a multi-worker cluster with a job running 2 > executors (1 per worker), when reading a large enough data file by both > executors. > Not reproducible in local mode. > > Circumvention: > So far, by disabling > _spark.sql.optimizer.canChangeCachedPlanOutputPartitioning_ or performing > repartition this can be alleviated, but it is not the fix of the root cause. > > This issue is dangerous considering that data loss is occurring silently and > in absence of proper checks can lead to wrong behaviour/results down the > line. So we have labeled it as a blocker. > > There seems to be a file-size treshold after which dataloss is observed > (possibly implying that it happens when both executors start reading the data > file) > > Minimal example: > {code:java} > // Read parent > val parentData = session.read.format("avro").load("/data/shared/test/parent") > // Self join parent and cache + materialize > val parent = parentData.join(parentData, Seq("PID")).cache() > parent.count() > // Read child > val child = session.read.format("avro").load("/data/shared/test/child") > // Basic join > val resultBasic = child.join( > parent, > parent("PID") === child("PARENT_ID") > ) > // Count: 16479 (Wrong) > println(s"Count no repartition: ${resultBasic.count()}") > // Repartition parent join > val resultRepartition = child.join( > parent.repartition(), > parent("PID") === child("PARENT_ID") > ) > // Count: 50094 (Correct) > println(s"Count with repartition: ${resultRepartition.count()}") {code} > > Invalid count-only DAG: > !Screenshot 2024-02-07 at 20.10.07.png|width=519,height=853! > Valid repartition DAG: > !Screenshot 2024-02-07 at 20.09.44.png|width=368,height=1219! > > Spark submit for this job: > {code:java} > spark-submit > --class ExampleApp > --packages org.apache.spark:spark-avro_2.12:3.5.0 > --deploy-mode cluster > --master spark://spark-master:6066 > --conf spark.sql.autoBroadcastJoinThreshold=-1 > --conf spark.cores.max=3 > --driver-cores 1 > --driver-memory 1g > --executor-cores 1 > --executor-memory 1g > /path/to/test.jar > {code} > The cluster should be setup to the following (worker1(m+e) worker2(e)) as to > split the executors onto two workers. > I have prepared a simple github repository which contains the compilable > above example. > [https://github.com/ridvanappabugis/spark-3.5-issue] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47019) AQE dynamic cache partitioning causes SortMergeJoin to result in data loss
[ https://issues.apache.org/jira/browse/SPARK-47019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816393#comment-17816393 ] Ridvan Appa Bugis commented on SPARK-47019: --- [~bersprockets] thanks for the quick investigation. I will be resolving this issue in that case. Looking forward to the 3.5.1 bump! > AQE dynamic cache partitioning causes SortMergeJoin to result in data loss > -- > > Key: SPARK-47019 > URL: https://issues.apache.org/jira/browse/SPARK-47019 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 3.5.0 > Environment: Tested in 3.5.0 > Reproduced on, so far: > * kubernetes deployment > * docker cluster deployment > Local Cluster: > * master > * worker1 (2/2G) > * worker2 (1/1G) >Reporter: Ridvan Appa Bugis >Priority: Blocker > Labels: DAG, caching, correctness, data-loss, > dynamic_allocation, inconsistency, partitioning > Attachments: Screenshot 2024-02-07 at 20.09.44.png, Screenshot > 2024-02-07 at 20.10.07.png, eventLogs-app-20240207175940-0023.zip, > testdata.zip > > > It seems like we have encountered an issue with Spark AQE's dynamic cache > partitioning which causes incorrect *count* output values and data loss. > A similar issue could not be found, so i am creating this ticket to raise > awareness. > > Preconditions: > - Setup a cluster as per environment specification > - Prepare test data (or a data large enough to trigger read by both > executors) > Steps to reproduce: > - Read parent > - Self join parent > - cache + materialize parent > - Join parent with child > > Performing a self-join over a parentDF, then caching + materialising the DF, > and then joining it with a childDF results in *incorrect* count value and > {*}missing data{*}. > > Performing a *repartition* seems to fix the issue, most probably due to > rearrangement of the underlying partitions and statistic update. > > This behaviour is observed over a multi-worker cluster with a job running 2 > executors (1 per worker), when reading a large enough data file by both > executors. > Not reproducible in local mode. > > Circumvention: > So far, by disabling > _spark.sql.optimizer.canChangeCachedPlanOutputPartitioning_ or performing > repartition this can be alleviated, but it is not the fix of the root cause. > > This issue is dangerous considering that data loss is occurring silently and > in absence of proper checks can lead to wrong behaviour/results down the > line. So we have labeled it as a blocker. > > There seems to be a file-size treshold after which dataloss is observed > (possibly implying that it happens when both executors start reading the data > file) > > Minimal example: > {code:java} > // Read parent > val parentData = session.read.format("avro").load("/data/shared/test/parent") > // Self join parent and cache + materialize > val parent = parentData.join(parentData, Seq("PID")).cache() > parent.count() > // Read child > val child = session.read.format("avro").load("/data/shared/test/child") > // Basic join > val resultBasic = child.join( > parent, > parent("PID") === child("PARENT_ID") > ) > // Count: 16479 (Wrong) > println(s"Count no repartition: ${resultBasic.count()}") > // Repartition parent join > val resultRepartition = child.join( > parent.repartition(), > parent("PID") === child("PARENT_ID") > ) > // Count: 50094 (Correct) > println(s"Count with repartition: ${resultRepartition.count()}") {code} > > Invalid count-only DAG: > !Screenshot 2024-02-07 at 20.10.07.png|width=519,height=853! > Valid repartition DAG: > !Screenshot 2024-02-07 at 20.09.44.png|width=368,height=1219! > > Spark submit for this job: > {code:java} > spark-submit > --class ExampleApp > --packages org.apache.spark:spark-avro_2.12:3.5.0 > --deploy-mode cluster > --master spark://spark-master:6066 > --conf spark.sql.autoBroadcastJoinThreshold=-1 > --conf spark.cores.max=3 > --driver-cores 1 > --driver-memory 1g > --executor-cores 1 > --executor-memory 1g > /path/to/test.jar > {code} > The cluster should be setup to the following (worker1(m+e) worker2(e)) as to > split the executors onto two workers. > I have prepared a simple github repository which contains the compilable > above example. > [https://github.com/ridvanappabugis/spark-3.5-issue] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org