[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.
[ https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46992: --- Labels: correctness pull-request-available (was: correctness) > Inconsistent results with 'sort', 'cache', and AQE. > --- > > Key: SPARK-46992 > URL: https://issues.apache.org/jira/browse/SPARK-46992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.5.0 >Reporter: Denis Tarima >Priority: Critical > Labels: correctness, pull-request-available > > > With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes > {color:#4c9aff}sample{color} results after caching. > Moreover, when cached, {color:#4c9aff}collect{color} returns records as if > it's not cached, which is inconsistent with {color:#4c9aff}count{color} and > {color:#4c9aff}show{color}. > A script to reproduce: > {code:scala} > import spark.implicits._ > val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) > println("NON CACHED:") > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > println("CACHED:") > df.cache().count() > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > df.unpersist() > {code} > output: > {code:java} > NON CACHED: > count: 2 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 4| > +---+ > CACHED: > count: 3 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 2| > | 3| > +---+ > {code} > BTW, disabling AQE > [{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", > "false"){color}] helps on Databricks clusters, but locally it has no effect, > at least on Spark 3.3.2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.
[ https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Tarima updated SPARK-46992: - Description: With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes {color:#4c9aff}sample{color} results after caching. Moreover, when cached, {color:#4c9aff}collect{color} returns records as if it's not cached, which is inconsistent with {color:#4c9aff}count{color} and {color:#4c9aff}show{color}. A script to reproduce: {code:scala} import spark.implicits._ val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) println("NON CACHED:") println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() println("CACHED:") df.cache().count() println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() df.unpersist() {code} output: {code:java} NON CACHED: count: 2 collect: [1] [4] +---+ | id| +---+ | 1| | 4| +---+ CACHED: count: 3 collect: [1] [4] +---+ | id| +---+ | 1| | 2| | 3| +---+ {code} BTW, disabling AQE [{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", "false"){color}] helps on Databricks clusters, but locally it has no effect, at least on Spark 3.3.2. was: With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes {color:#4c9aff}sample{color} results after caching. Moreover, when cached, {color:#4c9aff}collect{color} returns records as if it's not cached, which is inconsistent with {color:#4c9aff}count{color} and {color:#4c9aff}show{color}. A script to reproduce: {code:scala} import spark.implicits._ val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) println("NON CACHED:") println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() println("CACHED:") df.cache().count() println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() df.unpersist() {code} output: {code} NON CACHED: count: 2 collect: [1] [4] +---+ | id| +---+ | 1| | 4| +---+ CACHED: count: 3 collect: [1] [4] +---+ | id| +---+ | 1| | 2| | 3| +---+ {code} > Inconsistent results with 'sort', 'cache', and AQE. > --- > > Key: SPARK-46992 > URL: https://issues.apache.org/jira/browse/SPARK-46992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.5.0 >Reporter: Denis Tarima >Priority: Critical > Labels: correctness > > > With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes > {color:#4c9aff}sample{color} results after caching. > Moreover, when cached, {color:#4c9aff}collect{color} returns records as if > it's not cached, which is inconsistent with {color:#4c9aff}count{color} and > {color:#4c9aff}show{color}. > A script to reproduce: > {code:scala} > import spark.implicits._ > val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) > println("NON CACHED:") > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > println("CACHED:") > df.cache().count() > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > df.unpersist() > {code} > output: > {code:java} > NON CACHED: > count: 2 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 4| > +---+ > CACHED: > count: 3 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 2| > | 3| > +---+ > {code} > BTW, disabling AQE > [{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", > "false"){color}] helps on Databricks clusters, but locally it has no effect, > at least on Spark 3.3.2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.
[ https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-46992: - Labels: correctness (was: ) > Inconsistent results with 'sort', 'cache', and AQE. > --- > > Key: SPARK-46992 > URL: https://issues.apache.org/jira/browse/SPARK-46992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.5.0 >Reporter: Denis Tarima >Priority: Critical > Labels: correctness > > > With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes > {color:#4c9aff}sample{color} results after caching. > Moreover, when cached, {color:#4c9aff}collect{color} returns records as if > it's not cached, which is inconsistent with {color:#4c9aff}count{color} and > {color:#4c9aff}show{color}. > A script to reproduce: > {code:scala} > import spark.implicits._ > val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) > println("NON CACHED:") > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > println("CACHED:") > df.cache().count() > println(" count: " + df.count()) > println(" collect: " + df.collect().mkString(" ")) > df.show() > df.unpersist() > {code} > output: > {code} > NON CACHED: > count: 2 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 4| > +---+ > CACHED: > count: 3 > collect: [1] [4] > +---+ > | id| > +---+ > | 1| > | 2| > | 3| > +---+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.
[ https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denis Tarima updated SPARK-46992: - Description: With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes {color:#4c9aff}sample{color} results after caching. Moreover, when cached, {color:#4c9aff}collect{color} returns records as if it's not cached, which is inconsistent with {color:#4c9aff}count{color} and {color:#4c9aff}show{color}. A script to reproduce: {code:scala} import spark.implicits._ val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123) println("NON CACHED:") println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() println("CACHED:") df.cache().count() println(" count: " + df.count()) println(" collect: " + df.collect().mkString(" ")) df.show() df.unpersist() {code} output: {code} NON CACHED: count: 2 collect: [1] [4] +---+ | id| +---+ | 1| | 4| +---+ CACHED: count: 3 collect: [1] [4] +---+ | id| +---+ | 1| | 2| | 3| +---+ {code} was: With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes {color:#4c9aff}sample{color} results after caching. Moreover, when cached, {color:#4c9aff}collect{color} returns records as if it's not cached, which is inconsistent with {color:#4c9aff}count{color} and {color:#4c9aff}show{color}. A script to reproduce: {color:#ff}import{color}{color:#3b3b3b} spark.implicits._{color} {color:#ff}val{color}{color:#3b3b3b} {color}{color:#001080}df{color}{color:#3b3b3b} {color}{color:#00}={color}{color:#3b3b3b} ({color}{color:#098658}1{color}{color:#3b3b3b} to {color}{color:#098658}4{color}{color:#3b3b3b}).toDF({color}{color:#c72e0f}"id"{color}{color:#3b3b3b}).sort({color}{color:#c72e0f}"id"{color}{color:#3b3b3b}).sample({color}{color:#098658}0.4{color}{color:#3b3b3b}, {color}{color:#098658}123{color}{color:#3b3b3b}){color} {color:#3b3b3b}println({color}{color:#c72e0f}"NON CACHED:"{color}{color:#3b3b3b}){color} {color:#3b3b3b}println({color}{color:#c72e0f}" count: "{color}{color:#3b3b3b} {color}{color:#00}+{color}{color:#3b3b3b} df.count()){color} {color:#3b3b3b}println({color}{color:#c72e0f}" collect: "{color}{color:#3b3b3b} {color}{color:#00}+{color}{color:#3b3b3b} df.collect().mkString({color}{color:#c72e0f}" "{color}{color:#3b3b3b})){color} {color:#3b3b3b}df.show(){color} {color:#3b3b3b}println({color}{color:#c72e0f}"CACHED:"{color}{color:#3b3b3b}){color} {color:#3b3b3b}df.cache().count(){color} {color:#3b3b3b}println({color}{color:#c72e0f}" count: "{color}{color:#3b3b3b} {color}{color:#00}+{color}{color:#3b3b3b} df.count()){color} {color:#3b3b3b}println({color}{color:#c72e0f}" collect: "{color}{color:#3b3b3b} {color}{color:#00}+{color}{color:#3b3b3b} df.collect().mkString({color}{color:#c72e0f}" "{color}{color:#3b3b3b})){color} {color:#3b3b3b}df.show(){color} {color:#3b3b3b}df.unpersist(){color} output: {color:#267f99}NON{color}{color:#3b3b3b} {color}{color:#267f99}CACHED{color}{color:#00}:{color} {color:#3b3b3b} {color}{color:#001080}count{color}{color:#3b3b3b}: {color}{color:#098658}2{color} {color:#3b3b3b} {color}{color:#001080}collect{color}{color:#3b3b3b}: [{color}{color:#098658}1{color}{color:#3b3b3b}] [{color}{color:#098658}4{color}{color:#3b3b3b}]{color} {color:#00}+---+{color} {color:#00}|{color}{color:#3b3b3b} id{color}{color:#00}|{color} {color:#00}+---+{color} {color:#00}|{color}{color:#3b3b3b} {color}{color:#098658}1{color}{color:#00}|{color} {color:#00}|{color}{color:#3b3b3b} {color}{color:#098658}4{color}{color:#00}|{color} {color:#00}+---+{color} {color:#267f99}CACHED{color}{color:#00}:{color} {color:#3b3b3b} {color}{color:#001080}count{color}{color:#3b3b3b}: {color}{color:#098658}3{color} {color:#3b3b3b} {color}{color:#001080}collect{color}{color:#3b3b3b}: [{color}{color:#098658}1{color}{color:#3b3b3b}] [{color}{color:#098658}4{color}{color:#3b3b3b}]{color} {color:#00}+---+{color} {color:#00}|{color}{color:#3b3b3b} id{color}{color:#00}|{color} {color:#00}+---+{color} {color:#00}|{color}{color:#3b3b3b} {color}{color:#098658}1{color}{color:#00}|{color} {color:#00}|{color}{color:#3b3b3b} {color}{color:#098658}2{color}{color:#00}|{color} {color:#00}|{color}{color:#3b3b3b} {color}{color:#098658}3{color}{color:#00}|{color} {color:#00}+---+{color} > Inconsistent results with 'sort', 'cache', and AQE. > --- > > Key: SPARK-46992 > URL: https://issues.apache.org/jira/browse/SPARK-46992 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.5.0 >Reporter: Denis Tarima >Priority: Critical > > > With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes >