[jira] [Commented] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794081#comment-17794081 ] Kazuyuki Tanimura commented on SPARK-46295: --- I realized that I am using the default {{numPartitions=100}} while the CI (GHA) is using {{numPartitions=1}} So it looks this issue happens when {{numPartitions!=1}} > TPCDS q39a and a39b have correctness issues with broadcast hash join and > shuffled hash join > --- > > Key: SPARK-46295 > URL: https://issues.apache.org/jira/browse/SPARK-46295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Kazuyuki Tanimura >Priority: Major > Labels: correctness > > {{TPCDSQueryTestSuite}} fails for q39a and a39b with > {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with > {{sortMergeJoinConf}} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39a"{code} > {code} > [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds) > [info] java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but > got "...25 1.022382911080458[5 ..." > {code} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39b"{code} > {code} > [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds) > [info] java.lang.Exception: Expected "...34 1.563403519178623[3 3 > 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568116 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154002] > 4 16211 2 352.25 1", but got "...34 1.563403519178623[5 > 3 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568118 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154] > 4 16211 2 352.25 1" Result did not match > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-46295: -- Affects Version/s: 3.4.2 (was: 3.4.1) > TPCDS q39a and a39b have correctness issues with broadcast hash join and > shuffled hash join > --- > > Key: SPARK-46295 > URL: https://issues.apache.org/jira/browse/SPARK-46295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.2, 3.5.0, 4.0.0 >Reporter: Kazuyuki Tanimura >Priority: Major > Labels: correctness > > {{TPCDSQueryTestSuite}} fails for q39a and a39b with > {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with > {{sortMergeJoinConf}} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39a"{code} > {code} > [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds) > [info] java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but > got "...25 1.022382911080458[5 ..." > {code} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39b"{code} > {code} > [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds) > [info] java.lang.Exception: Expected "...34 1.563403519178623[3 3 > 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568116 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154002] > 4 16211 2 352.25 1", but got "...34 1.563403519178623[5 > 3 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568118 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154] > 4 16211 2 352.25 1" Result did not match > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-46295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-46295: -- Labels: correctness (was: ) > TPCDS q39a and a39b have correctness issues with broadcast hash join and > shuffled hash join > --- > > Key: SPARK-46295 > URL: https://issues.apache.org/jira/browse/SPARK-46295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Kazuyuki Tanimura >Priority: Major > Labels: correctness > > {{TPCDSQueryTestSuite}} fails for q39a and a39b with > {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with > {{sortMergeJoinConf}} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39a"{code} > {code} > [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds) > [info] java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but > got "...25 1.022382911080458[5 ..." > {code} > {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly > *TPCDSQueryTestSuite -- -z q39b"{code} > {code} > [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds) > [info] java.lang.Exception: Expected "...34 1.563403519178623[3 3 > 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568116 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154002] > 4 16211 2 352.25 1", but got "...34 1.563403519178623[5 > 3 10427 2 381.25 1.0623056061004696 > [info] 3 33151 271.75 1.555976998814345 3 3315 > 2 393.75 1.0196319345405949 > [info] 3 33931 260.0 1.5009563026568118 3 3393 > 2 470.25 1.129275872154205 > [info] 4 16211 1 257.7 1.6381074811154] > 4 16211 2 352.25 1" Result did not match > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46295) TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join
Kazuyuki Tanimura created SPARK-46295: - Summary: TPCDS q39a and a39b have correctness issues with broadcast hash join and shuffled hash join Key: SPARK-46295 URL: https://issues.apache.org/jira/browse/SPARK-46295 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1, 4.0.0 Reporter: Kazuyuki Tanimura {{TPCDSQueryTestSuite}} fails for q39a and a39b with {{broadcastHashJoinConf}} and {{shuffledHashJoinConf}}. It works fine with {{sortMergeJoinConf}} {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly *TPCDSQueryTestSuite -- -z q39a"{code} {code} [info] - q39a *** FAILED *** (19 seconds, 139 milliseconds) [info] java.lang.Exception: Expected "...25 1.022382911080458[8 ..." but got "...25 1.022382911080458[5 ..." {code} {code}SPARK_TPCDS_DATA= build/sbt "~sql/testOnly *TPCDSQueryTestSuite -- -z q39b"{code} {code} [info] - q39b *** FAILED *** (19 seconds, 351 milliseconds) [info] java.lang.Exception: Expected "...34 1.563403519178623[3 3 10427 2 381.25 1.0623056061004696 [info] 333151 271.75 1.555976998814345 3 3315 2 393.75 1.0196319345405949 [info] 333931 260.0 1.5009563026568116 3 3393 2 470.25 1.129275872154205 [info] 416211 1 257.7 1.6381074811154002] 4 16211 2 352.25 1", but got "...34 1.563403519178623[5 3 10427 2 381.25 1.0623056061004696 [info] 333151 271.75 1.555976998814345 3 3315 2 393.75 1.0196319345405949 [info] 333931 260.0 1.5009563026568118 3 3393 2 470.25 1.129275872154205 [info] 416211 1 257.7 1.6381074811154] 4 16211 2 352.25 1" Result did not match {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45786) Inaccurate Decimal multiplication and division results
[ https://issues.apache.org/jira/browse/SPARK-45786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-45786: -- Affects Version/s: 4.0.0 > Inaccurate Decimal multiplication and division results > -- > > Key: SPARK-45786 > URL: https://issues.apache.org/jira/browse/SPARK-45786 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.4, 3.3.3, 3.4.1, 3.5.0, 4.0.0 >Reporter: Kazuyuki Tanimura >Priority: Major > > Decimal multiplication and division results may be inaccurate due to rounding > issues. > h2. Multiplication: > {code:scala} > scala> sql("select -14120025096157587712113961295153.858047 * > -0.4652").show(truncate=false) > ++ > > |(-14120025096157587712113961295153.858047 * -0.4652)| > ++ > |6568635674732509803675414794505.574764 | > ++ > {code} > The correct answer is > {quote}6568635674732509803675414794505.574763 > {quote} > Please note that the last digit is 3 instead of 4 as > > {code:scala} > scala> > java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652")) > val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644 > {code} > Since the factional part .574763 is followed by 4644, it should not be > rounded up. > h2. Division: > {code:scala} > scala> sql("select -0.172787979 / > 533704665545018957788294905796.5").show(truncate=false) > +-+ > |(-0.172787979 / 533704665545018957788294905796.5)| > +-+ > |-3.237521E-31| > +-+ > {code} > The correct answer is > {quote}-3.237520E-31 > {quote} > Please note that the last digit is 0 instead of 1 as > > {code:scala} > scala> > java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"), > 100, java.math.RoundingMode.DOWN) > val res22: java.math.BigDecimal = > -3.237520489418037889998826491401059986665344697406144511563561222578738E-31 > {code} > Since the factional part .237520 is followed by 4894..., it should not be > rounded up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45786) Inaccurate Decimal multiplication and division results
Kazuyuki Tanimura created SPARK-45786: - Summary: Inaccurate Decimal multiplication and division results Key: SPARK-45786 URL: https://issues.apache.org/jira/browse/SPARK-45786 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1, 3.3.3, 3.2.4 Reporter: Kazuyuki Tanimura Decimal multiplication and division results may be inaccurate due to rounding issues. h2. Multiplication: {code:scala} scala> sql("select -14120025096157587712113961295153.858047 * -0.4652").show(truncate=false) ++ |(-14120025096157587712113961295153.858047 * -0.4652)| ++ |6568635674732509803675414794505.574764 | ++ {code} The correct answer is {quote}6568635674732509803675414794505.574763 {quote} Please note that the last digit is 3 instead of 4 as {code:scala} scala> java.math.BigDecimal("-14120025096157587712113961295153.858047").multiply(java.math.BigDecimal("-0.4652")) val res21: java.math.BigDecimal = 6568635674732509803675414794505.5747634644 {code} Since the factional part .574763 is followed by 4644, it should not be rounded up. h2. Division: {code:scala} scala> sql("select -0.172787979 / 533704665545018957788294905796.5").show(truncate=false) +-+ |(-0.172787979 / 533704665545018957788294905796.5)| +-+ |-3.237521E-31| +-+ {code} The correct answer is {quote}-3.237520E-31 {quote} Please note that the last digit is 0 instead of 1 as {code:scala} scala> java.math.BigDecimal("-0.172787979").divide(java.math.BigDecimal("533704665545018957788294905796.5"), 100, java.math.RoundingMode.DOWN) val res22: java.math.BigDecimal = -3.237520489418037889998826491401059986665344697406144511563561222578738E-31 {code} Since the factional part .237520 is followed by 4894..., it should not be rounded up. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42833) Refactor `applyExtensions` in `SparkSession`
Kazuyuki Tanimura created SPARK-42833: - Summary: Refactor `applyExtensions` in `SparkSession` Key: SPARK-42833 URL: https://issues.apache.org/jira/browse/SPARK-42833 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Refactor `applyExtensions` in `SparkSession` in order to reduce the duplicated codes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42256) SPIP: Lazy Materialization for Parquet Read Performance Improvement
[ https://issues.apache.org/jira/browse/SPARK-42256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-42256: -- Description: Spark-SQL filter operation is a common workload in order to select specific rows from persisted data. The current implementation of Spark requires the read values to materialize (i.e. de-compress, de-code, etc...) onto memory first before applying the filters. This approach means that the filters may eventually throw away many values, resulting in wasted computations. Alternatively, evaluating the filters first and lazily materializing only the used values can save waste and improve the read performance. Lazy materialization has been employed by other distributed SQL engines such as Velox and Presto/Trino, but this approach has not yet been extended to Spark with Parquet. SPIP: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME was: Spark-SQL filter operation is a common workload in order to select specific rows from persisted data. The current implementation of Spark requires the read values to materialize (i.e. de-compress, de-code, etc...) onto memory first before applying the filters. This approach means that the filters may eventually throw away many values, resulting in wasted computations. Alternatively, evaluating the filters first and lazily materializing only the used values can save waste and improve the read performance. Lazy materialization has been employed by other distributed SQL engines such as Velox and Presto/Trino, but this approach has not yet been extended to Spark with Parquet. SPIP: google doc > SPIP: Lazy Materialization for Parquet Read Performance Improvement > --- > > Key: SPARK-42256 > URL: https://issues.apache.org/jira/browse/SPARK-42256 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.5.0 >Reporter: Kazuyuki Tanimura >Priority: Major > Labels: SPIP > > Spark-SQL filter operation is a common workload in order to select > specific rows from persisted data. The current implementation of Spark > requires the read values to materialize (i.e. de-compress, de-code, etc...) > onto memory first before applying the filters. This approach means that the > filters may eventually throw away many values, resulting in wasted > computations. Alternatively, evaluating the filters first and lazily > materializing only the used values can save waste and improve the read > performance. Lazy materialization has been employed by other distributed SQL > engines such as Velox and Presto/Trino, but this approach has not yet been > extended to Spark with Parquet. > SPIP: > https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42256) SPIP: Lazy Materialization for Parquet Read Performance Improvement
Kazuyuki Tanimura created SPARK-42256: - Summary: SPIP: Lazy Materialization for Parquet Read Performance Improvement Key: SPARK-42256 URL: https://issues.apache.org/jira/browse/SPARK-42256 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.5.0 Reporter: Kazuyuki Tanimura Spark-SQL filter operation is a common workload in order to select specific rows from persisted data. The current implementation of Spark requires the read values to materialize (i.e. de-compress, de-code, etc...) onto memory first before applying the filters. This approach means that the filters may eventually throw away many values, resulting in wasted computations. Alternatively, evaluating the filters first and lazily materializing only the used values can save waste and improve the read performance. Lazy materialization has been employed by other distributed SQL engines such as Velox and Presto/Trino, but this approach has not yet been extended to Spark with Parquet. SPIP: google doc -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41096) Support reading parquet FIXED_LEN_BYTE_ARRAY type
Kazuyuki Tanimura created SPARK-41096: - Summary: Support reading parquet FIXED_LEN_BYTE_ARRAY type Key: SPARK-41096 URL: https://issues.apache.org/jira/browse/SPARK-41096 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Parquet has FIXED_LEN_BYTE_ARRAY (FLBA) data type. However, Spark Parquet reader currently cannot handle it. Read it as BinaryType in Spark. Iceberg Parquet reader, for example, can handle FLBA. This improvement should reduce the gap between Spark and Iceberg Parquet reader. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`
[ https://issues.apache.org/jira/browse/SPARK-40477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura resolved SPARK-40477. --- Resolution: Won't Fix gave another thought and decided to close this one not to be fixed. There is no natural code path of calling ColumnarBatchRow.get() for NullType columns, especially NullType cannot be stored as partition in columnar format like Parquet. > Support `NullType` in `ColumnarBatchRow` > > > Key: SPARK-40477 > URL: https://issues.apache.org/jira/browse/SPARK-40477 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > `ColumnarBatchRow.get()` does not support `NullType` currently. Support > `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column > type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40477) Support `NullType` in `ColumnarBatchRow`
Kazuyuki Tanimura created SPARK-40477: - Summary: Support `NullType` in `ColumnarBatchRow` Key: SPARK-40477 URL: https://issues.apache.org/jira/browse/SPARK-40477 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura `ColumnarBatchRow.get()` does not support `NullType` currently. Support `NullType` in `ColumnarBatchRow` so that `NullType` can be partition column type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40195) Add PrunedScanWithAQESuite
[ https://issues.apache.org/jira/browse/SPARK-40195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura resolved SPARK-40195. --- Resolution: Invalid I just realized the suite is not for AQE, so closing > Add PrunedScanWithAQESuite > -- > > Key: SPARK-40195 > URL: https://issues.apache.org/jira/browse/SPARK-40195 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently `PrunedScanSuite` assumes that AQE is always not applied. We should > also test with AQE force applied. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40195) Add PrunedScanWithAQESuite
Kazuyuki Tanimura created SPARK-40195: - Summary: Add PrunedScanWithAQESuite Key: SPARK-40195 URL: https://issues.apache.org/jira/browse/SPARK-40195 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Currently `PrunedScanSuite` assumes that AQE is always not applied. We should also test with AQE force applied. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40110) Add JDBCWithAQESuite
Kazuyuki Tanimura created SPARK-40110: - Summary: Add JDBCWithAQESuite Key: SPARK-40110 URL: https://issues.apache.org/jira/browse/SPARK-40110 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Currently `JDBCSuite` assumes that AQE is always turned off. We should also test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40088) Add SparkPlanWIthAQESuite
Kazuyuki Tanimura created SPARK-40088: - Summary: Add SparkPlanWIthAQESuite Key: SPARK-40088 URL: https://issues.apache.org/jira/browse/SPARK-40088 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Currently `SparkPlanSuite` assumes that AQE is always turned off. We should also test with AQE turned on -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40049) Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite
Kazuyuki Tanimura created SPARK-40049: - Summary: Add adaptive plan case in ReplaceNullWithFalseInPredicateEndToEndSuite Key: SPARK-40049 URL: https://issues.apache.org/jira/browse/SPARK-40049 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura Currently `ReplaceNullWithFalseInPredicateEndToEndSuite` assumes that adaptive query execution is turned off. We should add cases `spark.sql.adaptive.forceApply=true` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
Title: Message Title Kazuyuki Tanimura updated an issue Spark / SPARK-39584 Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results Change By: Kazuyuki Tanimura GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/ char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N.When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows.Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results.I am exploring two possible solutions now1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before reading. This is what Spark TPC-DS unit tests are doing2. Change varchar char to string in the schema. This is what [databricks data generator|https://github.com/databricks/spark-sql-perf] is doingTPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192History related varchar char issue [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
Title: Message Title Kazuyuki Tanimura commented on SPARK-39584 Re: Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results Benchmark results of the current code https://github.com/apache/spark/pull/37020/files After applying the fixes, the running time is expected to grow as current queries are getting null results and returning too fast Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-39584: -- Description: GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N. When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows. Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results. I am exploring two possible solutions now 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before reading. This is what Spark TPC-DS unit tests are doing 2. Change varchar to string in the schema. This is what [databricks data generator|https://github.com/databricks/spark-sql-perf] is doing TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192 History related varchar issue [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] was: GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N. When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows. Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results. I am exploring two possible solutions now 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before reading. This is what Spark unit tests are doing 2. Change varchar to string in the schema. This is what [databricks data generator|https://github.com/databricks/spark-sql-perf] is doing TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192 History related varchar issue [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains > varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for > strings whose lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before > reading. This is what Spark TPC-DS unit tests are doing > 2. Change varchar to string in the schema. This is what [databricks data > generator|https://github.com/databricks/spark-sql-perf] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related varchar issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-39584: -- Description: GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N. When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows. Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results. I am exploring two possible solutions now 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before reading. This is what Spark unit tests are doing 2. Change varchar to string in the schema. This is what [databricks data generator|https://github.com/databricks/spark-sql-perf] is doing TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192 History related varchar issue [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] was: GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N. When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows. Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results. I am exploring two possible solutions now 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` {}}}before reading. This is what Spark unit tests are doing 2. Change varchar to string in the schema. This is what [databricks data generator|[https://github.com/databricks/spark-sql-perf]] is doing TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192 History related varchar issue [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains > varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for > strings whose lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before > reading. This is what Spark unit tests are doing > 2. Change varchar to string in the schema. This is what [databricks data > generator|https://github.com/databricks/spark-sql-perf] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related varchar issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-39584: -- Description: GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N. When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows. Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results. I am exploring two possible solutions now 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` {}}}before reading. This is what Spark unit tests are doing 2. Change varchar to string in the schema. This is what [databricks data generator|[https://github.com/databricks/spark-sql-perf]] is doing TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192 History related varchar issue [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] was: GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N. When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows. Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results. I am exploring two possible solutions now 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` {}}}before reading. This is what Spark unit tests are doing 2. Change varchar to string in the schema. This is what [databricks data generator| [https://github.com/databricks/spark-sql-perf]] is doing TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192 History related varchar https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains > varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for > strings whose lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` > {}}}before reading. This is what Spark unit tests are doing > 2. Change varchar to string in the schema. This is what [databricks data > generator|[https://github.com/databricks/spark-sql-perf]] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related varchar issue > [https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
[ https://issues.apache.org/jira/browse/SPARK-39584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558678#comment-17558678 ] Kazuyuki Tanimura commented on SPARK-39584: --- Hi [~maropu] , pinging you since it seems you are the expert in this area. I am wondering if you have preference between the solution 1 vs 2? CC [~dongjoon] I will be adding benchmark results from the master for the future comparison purpose > Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results > > > Key: SPARK-39584 > URL: https://issues.apache.org/jira/browse/SPARK-39584 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > GenTPCDSData uses the schema defined in `TPCDSSchema` that contains > varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for > strings whose lengths are < N. > When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, > it uses schema from the parquet file and keeps the paddings. Due to the extra > spaces, string filter queries of TPC-DS fail to match. For example, q13 query > results are all nulls and returns too fast because string filter does not > meet any rows. > Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and > that is inflating some performance results. > I am exploring two possible solutions now > 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` > {}}}before reading. This is what Spark unit tests are doing > 2. Change varchar to string in the schema. This is what [databricks data > generator| [https://github.com/databricks/spark-sql-perf]] is doing > TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in > https://issues.apache.org/jira/browse/SPARK-35192 > History related varchar > https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39584) Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results
Kazuyuki Tanimura created SPARK-39584: - Summary: Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results Key: SPARK-39584 URL: https://issues.apache.org/jira/browse/SPARK-39584 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.3.0, 3.2.1, 3.1.2, 3.0.3, 3.4.0 Reporter: Kazuyuki Tanimura GenTPCDSData uses the schema defined in `TPCDSSchema` that contains varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N. When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows. Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results. I am exploring two possible solutions now 1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path` {}}}before reading. This is what Spark unit tests are doing 2. Change varchar to string in the schema. This is what [databricks data generator| [https://github.com/databricks/spark-sql-perf]] is doing TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192 History related varchar https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38573) Support Auto Partition Statistics Collection
[ https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-38573: -- Summary: Support Auto Partition Statistics Collection (was: Support Auto Partition Level Statistics Collection) > Support Auto Partition Statistics Collection > > > Key: SPARK-38573 > URL: https://issues.apache.org/jira/browse/SPARK-38573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Major > > Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing > the aggregated stats at table level for partitioned tables with config > spark.sql.statistics.size.autoUpdate.enabled. > Supporting partition level stats are useful to know which partitions are > outliers (skewed partition) and query optimizer works better with partition > level stats in case of partition pruning. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38786) Test Bug in StatisticsSuite "change stats after add/drop partition command"
Kazuyuki Tanimura created SPARK-38786: - Summary: Test Bug in StatisticsSuite "change stats after add/drop partition command" Key: SPARK-38786 URL: https://issues.apache.org/jira/browse/SPARK-38786 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Kazuyuki Tanimura [https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979] It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38573) Support Auto Partition Level Statistics Collection
[ https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-38573: -- Summary: Support Auto Partition Level Statistics Collection (was: Support Partition Level Statistics Collection) > Support Auto Partition Level Statistics Collection > -- > > Key: SPARK-38573 > URL: https://issues.apache.org/jira/browse/SPARK-38573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Major > > Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing > the aggregated stats at table level for partitioned tables with config > spark.sql.statistics.size.autoUpdate.enabled. > Supporting partition level stats are useful to know which partitions are > outliers (skewed partition) and query optimizer works better with partition > level stats in case of partition pruning. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38573) Support Partition Level Statistics Collection
[ https://issues.apache.org/jira/browse/SPARK-38573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-38573: -- Affects Version/s: 3.4.0 (was: 3.3.0) > Support Partition Level Statistics Collection > - > > Key: SPARK-38573 > URL: https://issues.apache.org/jira/browse/SPARK-38573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kazuyuki Tanimura >Priority: Major > > Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing > the aggregated stats at table level for partitioned tables with config > spark.sql.statistics.size.autoUpdate.enabled. > Supporting partition level stats are useful to know which partitions are > outliers (skewed partition) and query optimizer works better with partition > level stats in case of partition pruning. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38573) Support Partition Level Statistics Collection
Kazuyuki Tanimura created SPARK-38573: - Summary: Support Partition Level Statistics Collection Key: SPARK-38573 URL: https://issues.apache.org/jira/browse/SPARK-38573 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Kazuyuki Tanimura Currently https://issues.apache.org/jira/browse/SPARK-21127 supports storing the aggregated stats at table level for partitioned tables with config spark.sql.statistics.size.autoUpdate.enabled. Supporting partition level stats are useful to know which partitions are outliers (skewed partition) and query optimizer works better with partition level stats in case of partition pruning. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38142) Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized
Kazuyuki Tanimura created SPARK-38142: - Summary: Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized Key: SPARK-38142 URL: https://issues.apache.org/jira/browse/SPARK-38142 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Kazuyuki Tanimura Currently ArrowColumnVector is under org.apache.spark.sql.vectorized. However, ArrowColumnVectorSuite is under org.apache.spark.sql.execution.vectorized. Proposing to move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized so that the package names match -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38142) Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized
[ https://issues.apache.org/jira/browse/SPARK-38142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489101#comment-17489101 ] Kazuyuki Tanimura commented on SPARK-38142: --- on it > Move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized > -- > > Key: SPARK-38142 > URL: https://issues.apache.org/jira/browse/SPARK-38142 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Currently ArrowColumnVector is under org.apache.spark.sql.vectorized. > However, ArrowColumnVectorSuite is under > org.apache.spark.sql.execution.vectorized. > > Proposing to move ArrowColumnVectorSuite to org.apache.spark.sql.vectorized > so that the package names match -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488506#comment-17488506 ] Kazuyuki Tanimura commented on SPARK-36665: --- [~aokolnychyi] issue resolved. > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38132) Remove NotPropagation
Kazuyuki Tanimura created SPARK-38132: - Summary: Remove NotPropagation Key: SPARK-38132 URL: https://issues.apache.org/jira/browse/SPARK-38132 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Kazuyuki Tanimura To mitigate the bug introduced by SPARK-36665. Remove {{NotPropagation}} optimization for now until we find a better approach. {{NotPropagation}} optimization previously broke {{RewritePredicateSubquery}} so that it does not properly rewrite the predicate to a NULL-aware left anti join anymore. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487222#comment-17487222 ] Kazuyuki Tanimura commented on SPARK-36665: --- Understood, thank you [~aokolnychyi] I am preparing a fix. I am sorry for the inconvenience. > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486848#comment-17486848 ] Kazuyuki Tanimura commented on SPARK-36665: --- I saw the test case at [https://github.com/apache/spark/pull/35395] [~aokolnychyi] [~viirya] I will add a change to filter out the Not(InSubquery). I am also curious why RewritePredicateSubquery did not rewrite InSubquery to semi Join after the not optimization. I will look into it tomorrow. > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486829#comment-17486829 ] Kazuyuki Tanimura commented on SPARK-36665: --- [~aokolnychyi] Thank you for bringing this up. I would like to make sure if I understood the problem correctly, please bear with me. Example you posted is {quote}Not(Not(InSubquery(...)) <=> true) {quote} After the optimization, it is {quote}Not(InSubquery(...) <=> false) {quote} When you mention that "{{{}RewritePredicateSubquery{}}} does not rewrite...This leads to a wrong query result", did you mean that the query is suboptimal, or output values are wrong? > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Assignee: Kazuyuki Tanimura >Priority: Major > Fix For: 3.3.0 > > Attachments: Pasted Graphic 3.png > > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38086) Make ArrowColumnVector Extendable
[ https://issues.apache.org/jira/browse/SPARK-38086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485530#comment-17485530 ] Kazuyuki Tanimura commented on SPARK-38086: --- I am working on this > Make ArrowColumnVector Extendable > - > > Key: SPARK-38086 > URL: https://issues.apache.org/jira/browse/SPARK-38086 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kazuyuki Tanimura >Priority: Minor > > Some Spark extension libraries need to extend ArrowColumnVector.java. For > now, it is impossible as ArrowColumnVector class is final and the accessors > are all private. > For example, Rapids copies the entire ArrowColumnVector class in order to > work around the issue > [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java] > Proposing to relax private/final restrictions to make ArrowColumnVector > extendable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38086) Make ArrowColumnVector Extendable
Kazuyuki Tanimura created SPARK-38086: - Summary: Make ArrowColumnVector Extendable Key: SPARK-38086 URL: https://issues.apache.org/jira/browse/SPARK-38086 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Kazuyuki Tanimura Some Spark extension libraries need to extend ArrowColumnVector.java. For now, it is impossible as ArrowColumnVector class is final and the accessors are all private. For example, Rapids copies the entire ArrowColumnVector class in order to work around the issue [https://github.com/NVIDIA/spark-rapids/blob/main/sql-plugin/src/main/java/org/apache/spark/sql/vectorized/rapids/AccessibleArrowColumnVector.java] Proposing to relax private/final restrictions to make ArrowColumnVector extendable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35867) Enable vectorized read for VectorizedPlainValuesReader.readBooleans
[ https://issues.apache.org/jira/browse/SPARK-35867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440813#comment-17440813 ] Kazuyuki Tanimura commented on SPARK-35867: --- I am working on this > Enable vectorized read for VectorizedPlainValuesReader.readBooleans > --- > > Key: SPARK-35867 > URL: https://issues.apache.org/jira/browse/SPARK-35867 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Minor > > Currently we decode PLAIN encoded booleans as follow: > {code:java} > public final void readBooleans(int total, WritableColumnVector c, int > rowId) { > // TODO: properly vectorize this > for (int i = 0; i < total; i++) { > c.putBoolean(rowId + i, readBoolean()); > } > } > {code} > Ideally we should vectorize this. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36721) Simplify boolean equalities if one side is literal
[ https://issues.apache.org/jira/browse/SPARK-36721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17413404#comment-17413404 ] Kazuyuki Tanimura commented on SPARK-36721: --- I am working on this > Simplify boolean equalities if one side is literal > -- > > Key: SPARK-36721 > URL: https://issues.apache.org/jira/browse/SPARK-36721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kazuyuki Tanimura >Priority: Major > > The following query does not push down the filter > ``` > SELECT * FROM t WHERE (a AND b) = true > ``` > although the following equivalent query pushes down the filter as expected. > ``` > SELECT * FROM t WHERE (a AND b) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36721) Simplify boolean equalities if one side is literal
Kazuyuki Tanimura created SPARK-36721: - Summary: Simplify boolean equalities if one side is literal Key: SPARK-36721 URL: https://issues.apache.org/jira/browse/SPARK-36721 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 3.2.0, 3.3.0 Reporter: Kazuyuki Tanimura The following query does not push down the filter ``` SELECT * FROM t WHERE (a AND b) = true ``` although the following equivalent query pushes down the filter as expected. ``` SELECT * FROM t WHERE (a AND b) ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36665) Add more Not operator optimizations
[ https://issues.apache.org/jira/browse/SPARK-36665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17409746#comment-17409746 ] Kazuyuki Tanimura commented on SPARK-36665: --- I am working on this > Add more Not operator optimizations > --- > > Key: SPARK-36665 > URL: https://issues.apache.org/jira/browse/SPARK-36665 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0 >Reporter: Kazuyuki Tanimura >Priority: Major > > {{BooleanSimplification should be able to do more simplifications for Not > operators applying following rules}} > # {{Not(null) == null}} > ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} > # {{(Not(a) = b) == (a = Not(b))}} > ## {{e.g. Not(...) = true can be (...) = false}} > # {{(a != b) == (a = Not(b))}} > ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36665) Add more Not operator optimizations
Kazuyuki Tanimura created SPARK-36665: - Summary: Add more Not operator optimizations Key: SPARK-36665 URL: https://issues.apache.org/jira/browse/SPARK-36665 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 3.2.0, 3.3.0 Reporter: Kazuyuki Tanimura {{BooleanSimplification should be able to do more simplifications for Not operators applying following rules}} # {{Not(null) == null}} ## {{e.g. IsNull(Not(...)) can be IsNull(...)}} # {{(Not(a) = b) == (a = Not(b))}} ## {{e.g. Not(...) = true can be (...) = false}} # {{(a != b) == (a = Not(b))}} ## {{e.g. (...) != true can be (...) = false}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36644) Push down boolean column filter
Kazuyuki Tanimura created SPARK-36644: - Summary: Push down boolean column filter Key: SPARK-36644 URL: https://issues.apache.org/jira/browse/SPARK-36644 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.1.2, 3.2.0 Reporter: Kazuyuki Tanimura The following query does not push down the filter ``` SELECT * FROM t WHERE boolean_field ``` although the following query pushes down the filter as expected. ``` SELECT * FROM t WHERE boolean_field = true ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36644) Push down boolean column filter
[ https://issues.apache.org/jira/browse/SPARK-36644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17408348#comment-17408348 ] Kazuyuki Tanimura commented on SPARK-36644: --- I am working on this issue > Push down boolean column filter > --- > > Key: SPARK-36644 > URL: https://issues.apache.org/jira/browse/SPARK-36644 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Kazuyuki Tanimura >Priority: Major > > The following query does not push down the filter > ``` > SELECT * FROM t WHERE boolean_field > ``` > although the following query pushes down the filter as expected. > ``` > SELECT * FROM t WHERE boolean_field = true > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36607) Support BooleanType in UnwrapCastInBinaryComparison
Kazuyuki Tanimura created SPARK-36607: - Summary: Support BooleanType in UnwrapCastInBinaryComparison Key: SPARK-36607 URL: https://issues.apache.org/jira/browse/SPARK-36607 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2, 3.2.0, 3.3.0 Reporter: Kazuyuki Tanimura Enhancing the previous works from SPARK-24994 and SPARK-32858 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32210) Failed to serialize large MapStatuses
[ https://issues.apache.org/jira/browse/SPARK-32210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-32210: -- Affects Version/s: 3.3.0 2.4.8 3.0.3 > Failed to serialize large MapStatuses > - > > Key: SPARK-32210 > URL: https://issues.apache.org/jira/browse/SPARK-32210 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.4, 2.4.8, 3.0.3, 3.1.2, 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Driver side exception: > {noformat} > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-3] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-5] > spark.MapOutputTrackerMaster:91 : > java.lang.NegativeArraySizeException > at > org.apache.commons.io.output.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:322) > at > org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:984) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply$mcV$sp(MapOutputTracker.scala:228) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus$$anonfun$serializedMapStatus$2.apply(MapOutputTracker.scala:222) > at > org.apache.spark.ShuffleStatus.withWriteLock(MapOutputTracker.scala:72) > at > org.apache.spark.ShuffleStatus.serializedMapStatus(MapOutputTracker.scala:222) > at > org.apache.spark.MapOutputTrackerMaster$MessageLoop.run(MapOutputTracker.scala:493) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 20/07/07 02:22:26,366 ERROR [map-output-dispatcher-2] > spark.MapOutputTrackerMaster:91 : > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-36464: -- Description: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` was: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464" > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-36464: -- Description: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464" was: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` > > build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z > SPARK-36464" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-36464: -- Description: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` was: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `int`. That causes an overflow and returns negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
Kazuyuki Tanimura created SPARK-36464: - Summary: Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data Key: SPARK-36464 URL: https://issues.apache.org/jira/browse/SPARK-36464 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Reporter: Kazuyuki Tanimura The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `int`. That causes an overflow and returns negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org