[jira] [Commented] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join

2023-12-07 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794081#comment-17794081
 ] 

Kazuyuki Tanimura commented on SPARK-46295:
---

I realized that I am using the default  {{numPartitions=100}} while the CI 
(GHA) is using {{numPartitions=1}}

So it looks this issue happens when {{numPartitions!=1}}

> TPCDS q39a and a39b have correctness issues with broadcast hash join and 
> shuffled hash join
> ---
>
> Key: SPARK-46295
> URL: https://issues.apache.org/jira/browse/SPARK-46295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>  Labels: correctness
>
> {{TPCDSQueryTestSuite}} fails for q39a and a39b with 
> {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with 
> {{sortMergeJoinConf}}
> {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
> *TPCDSQueryTestSuite -- -z q39a"{code}
> {code}
> [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds)
> [info]   java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but 
> got "...25 1.022382911080458[5 ..."
> {code}
> {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
> *TPCDSQueryTestSuite -- -z q39b"{code}
> {code}
> [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds)
> [info]   java.lang.Exception: Expected "...34 1.563403519178623[3 3   
> 10427   2   381.25  1.0623056061004696
> [info] 3  33151   271.75  1.555976998814345   3   3315
> 2   393.75  1.0196319345405949
> [info] 3  33931   260.0   1.5009563026568116  3   3393
> 2   470.25  1.129275872154205
> [info] 4  16211   1   257.7   1.6381074811154002] 
> 4   16211   2   352.25  1", but got "...34  1.563403519178623[5   
>   3   10427   2   381.25  1.0623056061004696
> [info] 3  33151   271.75  1.555976998814345   3   3315
> 2   393.75  1.0196319345405949
> [info] 3  33931   260.0   1.5009563026568118  3   3393
> 2   470.25  1.129275872154205
> [info] 4  16211   1   257.7   1.6381074811154]
> 4   16211   2   352.25  1" Result did not match
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join

2023-12-06 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-46295:
--
Affects Version/s: 3.4.2
   (was: 3.4.1)

> TPCDS q39a and a39b have correctness issues with broadcast hash join and 
> shuffled hash join
> ---
>
> Key: SPARK-46295
> URL: https://issues.apache.org/jira/browse/SPARK-46295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.2, 3.5.0, 4.0.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>  Labels: correctness
>
> {{TPCDSQueryTestSuite}} fails for q39a and a39b with 
> {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with 
> {{sortMergeJoinConf}}
> {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
> *TPCDSQueryTestSuite -- -z q39a"{code}
> {code}
> [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds)
> [info]   java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but 
> got "...25 1.022382911080458[5 ..."
> {code}
> {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
> *TPCDSQueryTestSuite -- -z q39b"{code}
> {code}
> [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds)
> [info]   java.lang.Exception: Expected "...34 1.563403519178623[3 3   
> 10427   2   381.25  1.0623056061004696
> [info] 3  33151   271.75  1.555976998814345   3   3315
> 2   393.75  1.0196319345405949
> [info] 3  33931   260.0   1.5009563026568116  3   3393
> 2   470.25  1.129275872154205
> [info] 4  16211   1   257.7   1.6381074811154002] 
> 4   16211   2   352.25  1", but got "...34  1.563403519178623[5   
>   3   10427   2   381.25  1.0623056061004696
> [info] 3  33151   271.75  1.555976998814345   3   3315
> 2   393.75  1.0196319345405949
> [info] 3  33931   260.0   1.5009563026568118  3   3393
> 2   470.25  1.129275872154205
> [info] 4  16211   1   257.7   1.6381074811154]
> 4   16211   2   352.25  1" Result did not match
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join

2023-12-06 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-46295:
--
Labels: correctness  (was: )

> TPCDS q39a and a39b have correctness issues with broadcast hash join and 
> shuffled hash join
> ---
>
> Key: SPARK-46295
> URL: https://issues.apache.org/jira/browse/SPARK-46295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>  Labels: correctness
>
> {{TPCDSQueryTestSuite}} fails for q39a and a39b with 
> {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with 
> {{sortMergeJoinConf}}
> {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
> *TPCDSQueryTestSuite -- -z q39a"{code}
> {code}
> [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds)
> [info]   java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but 
> got "...25 1.022382911080458[5 ..."
> {code}
> {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
> *TPCDSQueryTestSuite -- -z q39b"{code}
> {code}
> [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds)
> [info]   java.lang.Exception: Expected "...34 1.563403519178623[3 3   
> 10427   2   381.25  1.0623056061004696
> [info] 3  33151   271.75  1.555976998814345   3   3315
> 2   393.75  1.0196319345405949
> [info] 3  33931   260.0   1.5009563026568116  3   3393
> 2   470.25  1.129275872154205
> [info] 4  16211   1   257.7   1.6381074811154002] 
> 4   16211   2   352.25  1", but got "...34  1.563403519178623[5   
>   3   10427   2   381.25  1.0623056061004696
> [info] 3  33151   271.75  1.555976998814345   3   3315
> 2   393.75  1.0196319345405949
> [info] 3  33931   260.0   1.5009563026568118  3   3393
> 2   470.25  1.129275872154205
> [info] 4  16211   1   257.7   1.6381074811154]
> 4   16211   2   352.25  1" Result did not match
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join

2023-12-06 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-46295:
-

 Summary: TPCDS q39a and a39b have correctness issues with 
broadcast hash join and shuffled hash join
 Key: SPARK-46295
 URL: https://issues.apache.org/jira/browse/SPARK-46295
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1, 4.0.0
Reporter: Kazuyuki Tanimura


{{TPCDSQueryTestSuite}} fails for q39a and a39b with {{broadcastHashJoinConf}} 
and {{shuffledHashJoinConf}}. It works fine with {{sortMergeJoinConf}}

{code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
*TPCDSQueryTestSuite -- -z q39a"{code}
{code}
[info] - q39a *** FAILED *** (19 seconds, 139 milliseconds)
[info]   java.lang.Exception: Expected "...25   1.022382911080458[8 ..." but 
got "...25 1.022382911080458[5 ..."
{code}

{code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly 
*TPCDSQueryTestSuite -- -z q39b"{code}

{code}
[info] - q39b *** FAILED *** (19 seconds, 351 milliseconds)
[info]   java.lang.Exception: Expected "...34   1.563403519178623[3 3   
10427   2   381.25  1.0623056061004696
[info] 333151   271.75  1.555976998814345   3   3315
2   393.75  1.0196319345405949
[info] 333931   260.0   1.5009563026568116  3   3393
2   470.25  1.129275872154205
[info] 416211   1   257.7   1.6381074811154002] 
4   16211   2   352.25  1", but got "...34  1.563403519178623[5 
3   10427   2   381.25  1.0623056061004696
[info] 333151   271.75  1.555976998814345   3   3315
2   393.75  1.0196319345405949
[info] 333931   260.0   1.5009563026568118  3   3393
2   470.25  1.129275872154205
[info] 416211   1   257.7   1.6381074811154]
4   16211   2   352.25  1" Result did not match
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45786) Inaccurate Decimal multiplication and division results

2023-11-03 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-45786:
--
Affects Version/s: 4.0.0

> Inaccurate Decimal multiplication and division results
> --
>
> Key: SPARK-45786
> URL: https://issues.apache.org/jira/browse/SPARK-45786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.4, 3.3.3, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> Decimal multiplication and division results may be inaccurate due to rounding 
> issues.
> h2. Multiplication:
> {code:scala}
> scala> sql("select  -14120025096157587712113961295153.858047 * 
> -0.4652").show(truncate=false)
> ++
>   
> |(-14120025096157587712113961295153.858047 * -0.4652)|
> ++
> |6568635674732509803675414794505.574764  |
> ++
> {code}
> The correct answer is
> {quote}6568635674732509803675414794505.574763
> {quote}
> Please note that the last digit is 3 instead of 4 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652"))
> val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644
> {code}
> Since the factional part .574763 is followed by 4644, it should not be 
> rounded up.
> h2. Division:
> {code:scala}
> scala> sql("select -0.172787979 / 
> 533704665545018957788294905796.5").show(truncate=false)
> +-+
> |(-0.172787979 / 533704665545018957788294905796.5)|
> +-+
> |-3.237521E-31|
> +-+
> {code}
> The correct answer is
> {quote}-3.237520E-31
> {quote}
> Please note that the last digit is 0 instead of 1 as
>  
> {code:scala}
> scala> 
> java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"),
>  100, java.math.RoundingMode.DOWN)
> val res22: java.math.BigDecimal = 
> -3.237520489418037889998826491401059986665344697406144511563561222578738E-31
> {code}
> Since the factional part .237520 is followed by 4894..., it should not be 
> rounded up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45786) Inaccurate Decimal multiplication and division results

2023-11-03 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-45786:
-

 Summary: Inaccurate Decimal multiplication and division results
 Key: SPARK-45786
 URL: https://issues.apache.org/jira/browse/SPARK-45786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1, 3.3.3, 3.2.4
Reporter: Kazuyuki Tanimura


Decimal multiplication and division results may be inaccurate due to rounding 
issues.
h2. Multiplication:
{code:scala}
scala> sql("select  -14120025096157587712113961295153.858047 * 
-0.4652").show(truncate=false)
++  
|(-14120025096157587712113961295153.858047 * -0.4652)|
++
|6568635674732509803675414794505.574764  |
++
{code}
The correct answer is
{quote}6568635674732509803675414794505.574763
{quote}

Please note that the last digit is 3 instead of 4 as

 
{code:scala}
scala> 
java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652"))
val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644
{code}
Since the factional part .574763 is followed by 4644, it should not be rounded 
up.
h2. Division:
{code:scala}
scala> sql("select -0.172787979 / 
533704665545018957788294905796.5").show(truncate=false)
+-+
|(-0.172787979 / 533704665545018957788294905796.5)|
+-+
|-3.237521E-31|
+-+
{code}
The correct answer is
{quote}-3.237520E-31
{quote}

Please note that the last digit is 0 instead of 1 as

 
{code:scala}
scala> 
java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"),
 100, java.math.RoundingMode.DOWN)
val res22: java.math.BigDecimal = 
-3.237520489418037889998826491401059986665344697406144511563561222578738E-31
{code}
Since the factional part .237520 is followed by 4894..., it should not be 
rounded up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42833) Refactor `applyExtensions` in `SparkSession`

2023-03-17 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-42833:
-

 Summary: Refactor `applyExtensions` in `SparkSession`
 Key: SPARK-42833
 URL: https://issues.apache.org/jira/browse/SPARK-42833
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Refactor `applyExtensions` in `SparkSession` in order to reduce the duplicated 
codes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42256) SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-42256:
--
Description: 
Spark-SQL filter operation is a common workload in order to select specific 
rows from persisted data. The current implementation of Spark requires the read 
values to materialize (i.e. de-compress, de-code, etc...) onto memory first 
before applying the filters. This approach means that the filters may 
eventually throw away many values, resulting in wasted computations. 
Alternatively, evaluating the filters first and lazily materializing only the 
used values can save waste and improve the read performance. Lazy 
materialization has been employed by other distributed SQL engines such as 
Velox and Presto/Trino, but this approach has not yet been extended to Spark 
with Parquet.

SPIP: 
https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME

  was:
Spark-SQL filter operation is a common workload in order to select specific 
rows from persisted data. The current implementation of Spark requires the read 
values to materialize (i.e. de-compress, de-code, etc...) onto memory first 
before applying the filters. This approach means that the filters may 
eventually throw away many values, resulting in wasted computations. 
Alternatively, evaluating the filters first and lazily materializing only the 
used values can save waste and improve the read performance. Lazy 
materialization has been employed by other distributed SQL engines such as 
Velox and Presto/Trino, but this approach has not yet been extended to Spark 
with Parquet.

SPIP: google doc


> SPIP: Lazy Materialization for Parquet Read Performance Improvement
> ---
>
> Key: SPARK-42256
> URL: https://issues.apache.org/jira/browse/SPARK-42256
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>  Labels: SPIP
>
> Spark-SQL filter operation is a common workload in order to select 
> specific rows from persisted data. The current implementation of Spark 
> requires the read values to materialize (i.e. de-compress, de-code, etc...) 
> onto memory first before applying the filters. This approach means that the 
> filters may eventually throw away many values, resulting in wasted 
> computations. Alternatively, evaluating the filters first and lazily 
> materializing only the used values can save waste and improve the read 
> performance. Lazy materialization has been employed by other distributed SQL 
> engines such as Velox and Presto/Trino, but this approach has not yet been 
> extended to Spark with Parquet.
> SPIP: 
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42256) SPIP: Lazy Materialization for Parquet Read Performance Improvement

2023-01-31 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-42256:
-

 Summary: SPIP: Lazy Materialization for Parquet Read Performance 
Improvement
 Key: SPARK-42256
 URL: https://issues.apache.org/jira/browse/SPARK-42256
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.5.0
Reporter: Kazuyuki Tanimura


Spark-SQL filter operation is a common workload in order to select specific 
rows from persisted data. The current implementation of Spark requires the read 
values to materialize (i.e. de-compress, de-code, etc...) onto memory first 
before applying the filters. This approach means that the filters may 
eventually throw away many values, resulting in wasted computations. 
Alternatively, evaluating the filters first and lazily materializing only the 
used values can save waste and improve the read performance. Lazy 
materialization has been employed by other distributed SQL engines such as 
Velox and Presto/Trino, but this approach has not yet been extended to Spark 
with Parquet.

SPIP: google doc



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41096) Support reading parquet FIXED_LEN_BYTE_ARRAY type

2022-11-10 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-41096:
-

 Summary: Support reading parquet FIXED_LEN_BYTE_ARRAY type
 Key: SPARK-41096
 URL: https://issues.apache.org/jira/browse/SPARK-41096
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Parquet has FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet 
reader currently cannot handle it.
Read it as BinaryType in Spark.

Iceberg Parquet reader, for example, can handle FLBA. This improvement should 
reduce the gap between Spark and Iceberg Parquet reader.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-20 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura resolved SPARK-40477.
---
Resolution: Won't Fix

gave another thought and decided to close this one not to be fixed. There is no 
natural code path of calling ColumnarBatchRow.get() for NullType columns, 
especially NullType cannot be stored as partition in columnar format like 
Parquet.

> Support `NullType` in `ColumnarBatchRow`
> 
>
> Key: SPARK-40477
> URL: https://issues.apache.org/jira/browse/SPARK-40477
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> `ColumnarBatchRow.get()` does not support `NullType` currently. Support 
> `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
> type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`

2022-09-16 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40477:
-

 Summary: Support `NullType` in `ColumnarBatchRow`
 Key: SPARK-40477
 URL: https://issues.apache.org/jira/browse/SPARK-40477
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


`ColumnarBatchRow.get()` does not support `NullType` currently. Support 
`NullType` in `ColumnarBatchRow` so that `NullType` can be partition column 
type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40195) Add PrunedScanWithAQESuite

2022-08-24 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura resolved SPARK-40195.
---
Resolution: Invalid

I just realized the suite is not for AQE, so closing

> Add PrunedScanWithAQESuite
> --
>
> Key: SPARK-40195
> URL: https://issues.apache.org/jira/browse/SPARK-40195
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently `PrunedScanSuite` assumes that AQE is always not applied. We should 
> also test with AQE force applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40195) Add PrunedScanWithAQESuite

2022-08-23 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40195:
-

 Summary: Add PrunedScanWithAQESuite
 Key: SPARK-40195
 URL: https://issues.apache.org/jira/browse/SPARK-40195
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Currently `PrunedScanSuite` assumes that AQE is always not applied. We should 
also test with AQE force applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40110) Add JDBCWithAQESuite

2022-08-16 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40110:
-

 Summary: Add JDBCWithAQESuite
 Key: SPARK-40110
 URL: https://issues.apache.org/jira/browse/SPARK-40110
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Currently `JDBCSuite` assumes that AQE is always turned off. We should also 
test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40088) Add SparkPlanWIthAQESuite

2022-08-15 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40088:
-

 Summary: Add SparkPlanWIthAQESuite
 Key: SPARK-40088
 URL: https://issues.apache.org/jira/browse/SPARK-40088
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Currently `SparkPlanSuite` assumes that AQE is always turned off. We should 
also test with AQE turned on



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40049) Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite

2022-08-11 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-40049:
-

 Summary: Add adaptive plan case in 
ReplaceNullWithFalseInPredicateEndToEndSuite
 Key: SPARK-40049
 URL: https://issues.apache.org/jira/browse/SPARK-40049
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


Currently `ReplaceNullWithFalseInPredicateEndToEndSuite` assumes that adaptive 
query execution is turned off. We should add cases 
`spark.sql.adaptive.forceApply=true`




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-06-28 Thread Kazuyuki Tanimura (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Kazuyuki Tanimura updated an issue  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
 Spark /  SPARK-39584  
 
 
  Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results   
 

  
 
 
 
 

 
Change By: 
 Kazuyuki Tanimura  
 

  
 
 
 
 

 
 GenTPCDSData uses the schema defined in `TPCDSSchema` that contains  varchar(N)/ char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N.When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows.Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results.I am exploring two possible solutions now1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before reading. This is what Spark TPC-DS unit tests are doing2. Change  varchar  char  to string in the schema. This is what [databricks data generator|https://github.com/databricks/spark-sql-perf] is doingTPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192History related  varchar  char  issue [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 

[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-06-28 Thread Kazuyuki Tanimura (Jira)
Title: Message Title


 
 
 
 

 
 
 

 
   
 Kazuyuki Tanimura commented on  SPARK-39584  
 

  
 
 
 
 

 
 
  
 
 
 
 

 
  Re: Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results   
 

  
 
 
 
 

 
 Benchmark results of the current code https://github.com/apache/spark/pull/37020/files After applying the fixes, the running time is expected to grow as current queries are getting null results and returning too fast  
 

  
 
 
 
 

 
 
 

 
 
 Add Comment  
 

  
 

  
 
 
 
  
 

  
 
 
 
 

 
 This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)  
 
 

 
   
 

  
 

  
 

   



[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-06-24 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-39584:
--
Description: 
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.

I am exploring two possible solutions now
1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
reading. This is what Spark TPC-DS unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator|https://github.com/databricks/spark-sql-perf] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar issue 
[https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]

  was:
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.

I am exploring two possible solutions now
1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator|https://github.com/databricks/spark-sql-perf] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar issue 
[https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]


> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
> varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
> strings whose lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
> reading. This is what Spark TPC-DS unit tests are doing
> 2. Change varchar to string in the schema. This is what [databricks data 
> generator|https://github.com/databricks/spark-sql-perf] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related varchar issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-06-24 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-39584:
--
Description: 
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.

I am exploring two possible solutions now
1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator|https://github.com/databricks/spark-sql-perf] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar issue 
[https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]

  was:
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.

I am exploring two possible solutions now
1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` 
{}}}before reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator|[https://github.com/databricks/spark-sql-perf]] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar issue 
[https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]


> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
> varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
> strings whose lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before 
> reading. This is what Spark unit tests are doing
> 2. Change varchar to string in the schema. This is what [databricks data 
> generator|https://github.com/databricks/spark-sql-perf] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related varchar issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-06-24 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-39584:
--
Description: 
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.

I am exploring two possible solutions now
1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` 
{}}}before reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator|[https://github.com/databricks/spark-sql-perf]] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar issue 
[https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]

  was:
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.


I am exploring two possible solutions now
1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` 
{}}}before reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator| [https://github.com/databricks/spark-sql-perf]] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar 
https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn


> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
> varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
> strings whose lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` 
> {}}}before reading. This is what Spark unit tests are doing
> 2. Change varchar to string in the schema. This is what [databricks data 
> generator|[https://github.com/databricks/spark-sql-perf]] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related varchar issue 
> [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-06-24 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558678#comment-17558678
 ] 

Kazuyuki Tanimura commented on SPARK-39584:
---

Hi [~maropu] , pinging you since it seems you are the expert in this area. I am 
wondering if you have preference between the solution 1 vs 2?

CC [~dongjoon] 

I will be adding benchmark results from the master for the future comparison 
purpose

> Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
> 
>
> Key: SPARK-39584
> URL: https://issues.apache.org/jira/browse/SPARK-39584
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
> varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
> strings whose lengths are < N.
> When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, 
> it uses schema from the parquet file and keeps the paddings. Due to the extra 
> spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
> results are all nulls and returns too fast because string filter does not 
> meet any rows.
> Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
> that is inflating some performance results.
> I am exploring two possible solutions now
> 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` 
> {}}}before reading. This is what Spark unit tests are doing
> 2. Change varchar to string in the schema. This is what [databricks data 
> generator| [https://github.com/databricks/spark-sql-perf]] is doing
> TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
> https://issues.apache.org/jira/browse/SPARK-35192
> History related varchar 
> https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results

2022-06-24 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-39584:
-

 Summary: Fix TPCDSQueryBenchmark Measuring Performance of Wrong 
Query Results
 Key: SPARK-39584
 URL: https://issues.apache.org/jira/browse/SPARK-39584
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0, 3.2.1, 3.1.2, 3.0.3, 3.4.0
Reporter: Kazuyuki Tanimura


GenTPCDSData uses the schema defined in `TPCDSSchema` that contains 
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for 
strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it 
uses schema from the parquet file and keeps the paddings. Due to the extra 
spaces, string filter queries of TPC-DS fail to match. For example, q13 query 
results are all nulls and returns too fast because string filter does not meet 
any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and 
that is inflating some performance results.


I am exploring two possible solutions now
1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` 
{}}}before reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data 
generator| [https://github.com/databricks/spark-sql-perf]] is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in 
https://issues.apache.org/jira/browse/SPARK-35192

History related varchar 
https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38573) Support Auto Partition Statistics Collection

2022-04-04 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-38573:
--
Summary: Support Auto Partition Statistics Collection  (was: Support Auto 
Partition Level Statistics Collection)

> Support Auto Partition Statistics Collection
> 
>
> Key: SPARK-38573
> URL: https://issues.apache.org/jira/browse/SPARK-38573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing 
> the aggregated stats at table level for partitioned tables with config 
> spark.sql.statistics.size.autoUpdate.enabled.
> Supporting partition level stats are useful to know which partitions are 
> outliers (skewed partition) and query optimizer works better with partition 
> level stats in case of partition pruning.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"

2022-04-04 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-38786:
-

 Summary: Test Bug in StatisticsSuite "change stats after add/drop 
partition command"
 Key: SPARK-38786
 URL: https://issues.apache.org/jira/browse/SPARK-38786
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Kazuyuki Tanimura


[https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979]

It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste 
bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38573) Support Auto Partition Level Statistics Collection

2022-04-04 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-38573:
--
Summary: Support Auto Partition Level Statistics Collection  (was: Support 
Partition Level Statistics Collection)

> Support Auto Partition Level Statistics Collection
> --
>
> Key: SPARK-38573
> URL: https://issues.apache.org/jira/browse/SPARK-38573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing 
> the aggregated stats at table level for partitioned tables with config 
> spark.sql.statistics.size.autoUpdate.enabled.
> Supporting partition level stats are useful to know which partitions are 
> outliers (skewed partition) and query optimizer works better with partition 
> level stats in case of partition pruning.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38573) Support Partition Level Statistics Collection

2022-04-04 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-38573:
--
Affects Version/s: 3.4.0
   (was: 3.3.0)

> Support Partition Level Statistics Collection
> -
>
> Key: SPARK-38573
> URL: https://issues.apache.org/jira/browse/SPARK-38573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing 
> the aggregated stats at table level for partitioned tables with config 
> spark.sql.statistics.size.autoUpdate.enabled.
> Supporting partition level stats are useful to know which partitions are 
> outliers (skewed partition) and query optimizer works better with partition 
> level stats in case of partition pruning.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38573) Support Partition Level Statistics Collection

2022-03-16 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-38573:
-

 Summary: Support Partition Level Statistics Collection
 Key: SPARK-38573
 URL: https://issues.apache.org/jira/browse/SPARK-38573
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kazuyuki Tanimura


Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing 
the aggregated stats at table level for partitioned tables with config 
spark.sql.statistics.size.autoUpdate.enabled.

Supporting partition level stats are useful to know which partitions are 
outliers (skewed partition) and query optimizer works better with partition 
level stats in case of partition pruning.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38142) Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized

2022-02-08 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-38142:
-

 Summary: Move ArrowColumnVectorSuite to 
org.apache.spark.sql.vectorized
 Key: SPARK-38142
 URL: https://issues.apache.org/jira/browse/SPARK-38142
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kazuyuki Tanimura


Currently ArrowColumnVector is under org.apache.spark.sql.vectorized. However, 
ArrowColumnVectorSuite is under org.apache.spark.sql.execution.vectorized.

 

Proposing to move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized so 
that the package names match



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38142) Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized

2022-02-08 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489101#comment-17489101
 ] 

Kazuyuki Tanimura commented on SPARK-38142:
---

on it

> Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized
> --
>
> Key: SPARK-38142
> URL: https://issues.apache.org/jira/browse/SPARK-38142
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Currently ArrowColumnVector is under org.apache.spark.sql.vectorized. 
> However, ArrowColumnVectorSuite is under 
> org.apache.spark.sql.execution.vectorized.
>  
> Proposing to move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized 
> so that the package names match



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-07 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488506#comment-17488506
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

[~aokolnychyi] issue resolved.

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38132) Remove NotPropagation

2022-02-07 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-38132:
-

 Summary: Remove NotPropagation
 Key: SPARK-38132
 URL: https://issues.apache.org/jira/browse/SPARK-38132
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kazuyuki Tanimura


To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} 
optimization for now until we find a better approach.

{{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} 
so that it does not properly rewrite the predicate to a NULL-aware left anti 
join anymore.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-04 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487222#comment-17487222
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

Understood, thank you [~aokolnychyi] 

I am preparing a fix. I am sorry for the inconvenience.

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-03 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486848#comment-17486848
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

I saw the test case at [https://github.com/apache/spark/pull/35395]

 

[~aokolnychyi] [~viirya] I will add a change to filter out the Not(InSubquery). 
I am also curious why RewritePredicateSubquery did not rewrite InSubquery to 
semi Join after the not optimization. I will look into it tomorrow.

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2022-02-03 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486829#comment-17486829
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

[~aokolnychyi] Thank you for bringing this up. I would like to make sure if I 
understood the problem correctly, please bear with me.

Example you posted is 
{quote}Not(Not(InSubquery(...)) <=> true)
{quote}
After the optimization, it is 
{quote}Not(InSubquery(...) <=> false)
{quote}
When you mention that "{{{}RewritePredicateSubquery{}}} does not rewrite...This 
leads to a wrong query result", did you mean that the query is suboptimal, or 
output values are wrong?

 

 

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Assignee: Kazuyuki Tanimura
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: Pasted Graphic 3.png
>
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38086) Make ArrowColumnVector Extendable

2022-02-01 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485530#comment-17485530
 ] 

Kazuyuki Tanimura commented on SPARK-38086:
---

I am working on this

> Make ArrowColumnVector Extendable
> -
>
> Key: SPARK-38086
> URL: https://issues.apache.org/jira/browse/SPARK-38086
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Minor
>
> Some Spark extension libraries need to extend ArrowColumnVector.java. For 
> now, it is impossible as ArrowColumnVector class is final and the accessors 
> are all private.
> For example, Rapids copies the entire ArrowColumnVector class in order to 
> work around the issue
> [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java]
> Proposing to relax private/final restrictions to make ArrowColumnVector 
> extendable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38086) Make ArrowColumnVector Extendable

2022-02-01 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-38086:
-

 Summary: Make ArrowColumnVector Extendable
 Key: SPARK-38086
 URL: https://issues.apache.org/jira/browse/SPARK-38086
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Kazuyuki Tanimura


Some Spark extension libraries need to extend ArrowColumnVector.java. For now, 
it is impossible as ArrowColumnVector class is final and the accessors are all 
private.

For example, Rapids copies the entire ArrowColumnVector class in order to work 
around the issue

[https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java]

Proposing to relax private/final restrictions to make ArrowColumnVector 
extendable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans

2021-11-08 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440813#comment-17440813
 ] 

Kazuyuki Tanimura commented on SPARK-35867:
---

I am working on this

> Enable vectorized read for VectorizedPlainValuesReader.readBooleans
> ---
>
> Key: SPARK-35867
> URL: https://issues.apache.org/jira/browse/SPARK-35867
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Minor
>
> Currently we decode PLAIN encoded booleans as follow:
> {code:java}
>   public final void readBooleans(int total, WritableColumnVector c, int 
> rowId) {
> // TODO: properly vectorize this
> for (int i = 0; i < total; i++) {
>   c.putBoolean(rowId + i, readBoolean());
> }
>   }
> {code}
> Ideally we should vectorize this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36721) Simplify boolean equalities if one side is literal

2021-09-10 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413404#comment-17413404
 ] 

Kazuyuki Tanimura commented on SPARK-36721:
---

I am working on this

> Simplify boolean equalities if one side is literal
> --
>
> Key: SPARK-36721
> URL: https://issues.apache.org/jira/browse/SPARK-36721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE (a AND b) = true
> ```
> although the following equivalent query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE (a AND b) 
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36721) Simplify boolean equalities if one side is literal

2021-09-10 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-36721:
-

 Summary: Simplify boolean equalities if one side is literal
 Key: SPARK-36721
 URL: https://issues.apache.org/jira/browse/SPARK-36721
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.2.0, 3.3.0
Reporter: Kazuyuki Tanimura


The following query does not push down the filter 

```

SELECT * FROM t WHERE (a AND b) = true

```

although the following equivalent query pushes down the filter as expected.

```

SELECT * FROM t WHERE (a AND b) 

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36665) Add more Not operator optimizations

2021-09-03 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409746#comment-17409746
 ] 

Kazuyuki Tanimura commented on SPARK-36665:
---

I am working on this

> Add more Not operator optimizations
> ---
>
> Key: SPARK-36665
> URL: https://issues.apache.org/jira/browse/SPARK-36665
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> {{BooleanSimplification should be able to do more simplifications for Not 
> operators applying following rules}}
>  # {{Not(null) == null}}
>  ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
>  # {{(Not(a) = b) == (a = Not(b))}}
>  ## {{e.g. Not(...) = true can be (...) = false}}
>  # {{(a != b) == (a = Not(b))}}
>  ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36665) Add more Not operator optimizations

2021-09-03 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-36665:
-

 Summary: Add more Not operator optimizations
 Key: SPARK-36665
 URL: https://issues.apache.org/jira/browse/SPARK-36665
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.2.0, 3.3.0
Reporter: Kazuyuki Tanimura


{{BooleanSimplification should be able to do more simplifications for Not 
operators applying following rules}}
 # {{Not(null) == null}}
 ## {{e.g. IsNull(Not(...)) can be IsNull(...)}}
 # {{(Not(a) = b) == (a = Not(b))}}
 ## {{e.g. Not(...) = true can be (...) = false}}
 # {{(a != b) == (a = Not(b))}}
 ## {{e.g. (...) != true can be (...) = false}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36644) Push down boolean column filter

2021-09-01 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-36644:
-

 Summary: Push down boolean column filter
 Key: SPARK-36644
 URL: https://issues.apache.org/jira/browse/SPARK-36644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.1.2, 3.2.0
Reporter: Kazuyuki Tanimura


The following query does not push down the filter 

```

SELECT * FROM t WHERE boolean_field

```

although the following query pushes down the filter as expected.

```

SELECT * FROM t WHERE boolean_field = true

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36644) Push down boolean column filter

2021-09-01 Thread Kazuyuki Tanimura (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408348#comment-17408348
 ] 

Kazuyuki Tanimura commented on SPARK-36644:
---

I am working on this issue

> Push down boolean column filter
> ---
>
> Key: SPARK-36644
> URL: https://issues.apache.org/jira/browse/SPARK-36644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The following query does not push down the filter 
> ```
> SELECT * FROM t WHERE boolean_field
> ```
> although the following query pushes down the filter as expected.
> ```
> SELECT * FROM t WHERE boolean_field = true
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36607) Support BooleanType in UnwrapCastInBinaryComparison

2021-08-28 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-36607:
-

 Summary: Support BooleanType in UnwrapCastInBinaryComparison
 Key: SPARK-36607
 URL: https://issues.apache.org/jira/browse/SPARK-36607
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.2.0, 3.3.0
Reporter: Kazuyuki Tanimura


Enhancing the previous works from SPARK-24994 and SPARK-32858



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32210) Failed to serialize large MapStatuses

2021-08-10 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-32210:
--
Affects Version/s: 3.3.0
   2.4.8
   3.0.3

> Failed to serialize large MapStatuses
> -
>
> Key: SPARK-32210
> URL: https://issues.apache.org/jira/browse/SPARK-32210
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Driver side exception:
> {noformat}
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] 
> spark.MapOutputTrackerMaster:91 :
> java.lang.NegativeArraySizeException
> at 
> org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322)
> at 
> org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222)
> at 
> org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72)
> at 
> org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222)
> at 
> org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] 
> spark.MapOutputTrackerMaster:91 :
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-36464:
--
Description: 
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

  was:
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

 

build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464"


> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-36464:
--
Description: 
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

 

build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464"

  was:
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`


> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`
>  
> build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z 
> SPARK-36464"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-36464:
--
Description: 
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

  was:
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `int`.

That causes an overflow and returns negative size when over 2GB data is written 
into `ChunkedByteBufferOutputStream`


> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)
Kazuyuki Tanimura created SPARK-36464:
-

 Summary: Fix Underlying Size Variable Initialization in 
ChunkedByteBufferOutputStream for Writing Over 2GB Data
 Key: SPARK-36464
 URL: https://issues.apache.org/jira/browse/SPARK-36464
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Kazuyuki Tanimura


The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `int`.

That causes an overflow and returns negative size when over 2GB data is written 
into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org