[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-20 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46992:
---
Labels: correctness pull-request-available  (was: correctness)

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>  Labels: correctness, pull-request-available
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code:java}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}
> BTW, disabling AQE 
> [{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 
> "false"){color}] helps on Databricks clusters, but locally it has no effect, 
> at least on Spark 3.3.2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-06 Thread Denis Tarima (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Tarima updated SPARK-46992:
-
Description: 
 
With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
{color:#4c9aff}sample{color} results after caching.

Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
{color:#4c9aff}show{color}.

A script to reproduce:
{code:scala}
import spark.implicits._
val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)

println("NON CACHED:")

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

println("CACHED:")
df.cache().count()

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

df.unpersist()
{code}
output:
{code:java}
NON CACHED:
  count: 2
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  4|
+---+

CACHED:
  count: 3
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+
{code}
BTW, disabling AQE 
[{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 
"false"){color}] helps on Databricks clusters, but locally it has no effect, at 
least on Spark 3.3.2.

  was:
 
With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
{color:#4c9aff}sample{color} results after caching.

Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
{color:#4c9aff}show{color}.

A script to reproduce:
{code:scala}
import spark.implicits._
val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)

println("NON CACHED:")

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

println("CACHED:")
df.cache().count()

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

df.unpersist()
{code}

output:
{code}
NON CACHED:
  count: 2
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  4|
+---+

CACHED:
  count: 3
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+
{code}


> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>  Labels: correctness
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code:java}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}
> BTW, disabling AQE 
> [{color:#4c9aff}spark.conf.set("spark.databricks.optimizer.adaptive.enabled", 
> "false"){color}] helps on Databricks clusters, but locally it has no effect, 
> at least on Spark 3.3.2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-06 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-46992:
-
Labels: correctness  (was: )

> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>  Labels: correctness
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
> {color:#4c9aff}sample{color} results after caching.
> Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
> it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
> {color:#4c9aff}show{color}.
> A script to reproduce:
> {code:scala}
> import spark.implicits._
> val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)
> println("NON CACHED:")
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> println("CACHED:")
> df.cache().count()
> println("  count: " + df.count())
> println("  collect: " + df.collect().mkString(" "))
> df.show()
> df.unpersist()
> {code}
> output:
> {code}
> NON CACHED:
>   count: 2
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  4|
> +---+
> CACHED:
>   count: 3
>   collect: [1] [4]
> +---+
> | id|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46992) Inconsistent results with 'sort', 'cache', and AQE.

2024-02-06 Thread Denis Tarima (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Tarima updated SPARK-46992:
-
Description: 
 
With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
{color:#4c9aff}sample{color} results after caching.

Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
{color:#4c9aff}show{color}.

A script to reproduce:
{code:scala}
import spark.implicits._
val df = (1 to 4).toDF("id").sort("id").sample(0.4, 123)

println("NON CACHED:")

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

println("CACHED:")
df.cache().count()

println("  count: " + df.count())
println("  collect: " + df.collect().mkString(" "))
df.show()

df.unpersist()
{code}

output:
{code}
NON CACHED:
  count: 2
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  4|
+---+

CACHED:
  count: 3
  collect: [1] [4]
+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+
{code}

  was:
 
With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
{color:#4c9aff}sample{color} results after caching.

Moreover, when cached,  {color:#4c9aff}collect{color} returns records as if 
it's not cached, which is inconsistent with {color:#4c9aff}count{color} and 
{color:#4c9aff}show{color}.

A script to reproduce:
{color:#ff}import{color}{color:#3b3b3b} spark.implicits._{color}
{color:#ff}val{color}{color:#3b3b3b} 
{color}{color:#001080}df{color}{color:#3b3b3b} 
{color}{color:#00}={color}{color:#3b3b3b} 
({color}{color:#098658}1{color}{color:#3b3b3b} to 
{color}{color:#098658}4{color}{color:#3b3b3b}).toDF({color}{color:#c72e0f}"id"{color}{color:#3b3b3b}).sort({color}{color:#c72e0f}"id"{color}{color:#3b3b3b}).sample({color}{color:#098658}0.4{color}{color:#3b3b3b},
 {color}{color:#098658}123{color}{color:#3b3b3b}){color}

{color:#3b3b3b}println({color}{color:#c72e0f}"NON 
CACHED:"{color}{color:#3b3b3b}){color}

{color:#3b3b3b}println({color}{color:#c72e0f}" count: "{color}{color:#3b3b3b} 
{color}{color:#00}+{color}{color:#3b3b3b} df.count()){color}
{color:#3b3b3b}println({color}{color:#c72e0f}" collect: "{color}{color:#3b3b3b} 
{color}{color:#00}+{color}{color:#3b3b3b} 
df.collect().mkString({color}{color:#c72e0f}" "{color}{color:#3b3b3b})){color}
{color:#3b3b3b}df.show(){color}

{color:#3b3b3b}println({color}{color:#c72e0f}"CACHED:"{color}{color:#3b3b3b}){color}
{color:#3b3b3b}df.cache().count(){color}

{color:#3b3b3b}println({color}{color:#c72e0f}" count: "{color}{color:#3b3b3b} 
{color}{color:#00}+{color}{color:#3b3b3b} df.count()){color}
{color:#3b3b3b}println({color}{color:#c72e0f}" collect: "{color}{color:#3b3b3b} 
{color}{color:#00}+{color}{color:#3b3b3b} 
df.collect().mkString({color}{color:#c72e0f}" "{color}{color:#3b3b3b})){color}
{color:#3b3b3b}df.show(){color}

{color:#3b3b3b}df.unpersist(){color}
output:
{color:#267f99}NON{color}{color:#3b3b3b} 
{color}{color:#267f99}CACHED{color}{color:#00}:{color}
{color:#3b3b3b} {color}{color:#001080}count{color}{color:#3b3b3b}: 
{color}{color:#098658}2{color}
{color:#3b3b3b} {color}{color:#001080}collect{color}{color:#3b3b3b}: 
[{color}{color:#098658}1{color}{color:#3b3b3b}] 
[{color}{color:#098658}4{color}{color:#3b3b3b}]{color}
{color:#00}+---+{color}
{color:#00}|{color}{color:#3b3b3b} id{color}{color:#00}|{color}
{color:#00}+---+{color}
{color:#00}|{color}{color:#3b3b3b} 
{color}{color:#098658}1{color}{color:#00}|{color}
{color:#00}|{color}{color:#3b3b3b} 
{color}{color:#098658}4{color}{color:#00}|{color}
{color:#00}+---+{color}

{color:#267f99}CACHED{color}{color:#00}:{color}
{color:#3b3b3b} {color}{color:#001080}count{color}{color:#3b3b3b}: 
{color}{color:#098658}3{color}
{color:#3b3b3b} {color}{color:#001080}collect{color}{color:#3b3b3b}: 
[{color}{color:#098658}1{color}{color:#3b3b3b}] 
[{color}{color:#098658}4{color}{color:#3b3b3b}]{color}
{color:#00}+---+{color}
{color:#00}|{color}{color:#3b3b3b} id{color}{color:#00}|{color}
{color:#00}+---+{color}
{color:#00}|{color}{color:#3b3b3b} 
{color}{color:#098658}1{color}{color:#00}|{color}
{color:#00}|{color}{color:#3b3b3b} 
{color}{color:#098658}2{color}{color:#00}|{color}
{color:#00}|{color}{color:#3b3b3b} 
{color}{color:#098658}3{color}{color:#00}|{color}
{color:#00}+---+{color}


> Inconsistent results with 'sort', 'cache', and AQE.
> ---
>
> Key: SPARK-46992
> URL: https://issues.apache.org/jira/browse/SPARK-46992
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.2, 3.5.0
>Reporter: Denis Tarima
>Priority: Critical
>
>  
> With AQE enabled, having {color:#4c9aff}sort{color} in the plan changes 
>