[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-47120: --- Description: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} [fix PR |https://github.com/apache/spark/pull/45202/files] was: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} I'll provide a fix PR shortly > Null comparison push down data filter from subquery produces in NPE in > Parquet filter > - > > Key: SPARK-47120 > URL: https://issues.apache.org/jira/browse/SPARK-47120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > Labels: pull-request-available > > This issue has been introduced in > [https://github.com/apache/spark/pull/41088] where we convert scalar > subqueries to literals and then convert the literals to > {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed > down to parquet. > If the literal is a comparison with {{null}} then the parquet filter > conversion code throws NPE. > > repro code which results in NPE > {code:java} > create table t1(d date) using parquet > create table t2(d date) using parquet > insert into t1 values date'2021-01-01' > insert into t2 values (null) > select * from t1 where 1=1 and d > (select d from t2){code} > [fix PR |https://github.com/apache/spark/pull/45202/files] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-47120: --- Description: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} I'll provide a fix PR shortly was: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} > Null comparison push down data filter from subquery produces in NPE in > Parquet filter > - > > Key: SPARK-47120 > URL: https://issues.apache.org/jira/browse/SPARK-47120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > > This issue has been introduced in > [https://github.com/apache/spark/pull/41088] where we convert scalar > subqueries to literals and then convert the literals to > {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed > down to parquet. > If the literal is a comparison with {{null}} then the parquet filter > conversion code throws NPE. > > repro code which results in NPE > {code:java} > create table t1(d date) using parquet > create table t2(d date) using parquet > insert into t1 values date'2021-01-01' > insert into t2 values (null) > select * from t1 where 1=1 and d > (select d from t2){code} > I'll provide a fix PR shortly -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47120) Null comparison push down data filter from subquery produces in NPE in Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-47120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-47120: --- Description: This issue has been introduced in [https://github.com/apache/spark/pull/41088] where we convert scalar subqueries to literals and then convert the literals to {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed down to parquet. If the literal is a comparison with {{null}} then the parquet filter conversion code throws NPE. repro code which results in NPE {code:java} create table t1(d date) using parquet create table t2(d date) using parquet insert into t1 values date'2021-01-01' insert into t2 values (null) select * from t1 where 1=1 and d > (select d from t2){code} > Null comparison push down data filter from subquery produces in NPE in > Parquet filter > - > > Key: SPARK-47120 > URL: https://issues.apache.org/jira/browse/SPARK-47120 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > > This issue has been introduced in > [https://github.com/apache/spark/pull/41088] where we convert scalar > subqueries to literals and then convert the literals to > {{{}org.apache.spark.sql.sources.Filters{}}}. These filters are then pushed > down to parquet. > If the literal is a comparison with {{null}} then the parquet filter > conversion code throws NPE. > > repro code which results in NPE > {code:java} > create table t1(d date) using parquet > create table t2(d date) using parquet > insert into t1 values date'2021-01-01' > insert into t2 values (null) > select * from t1 where 1=1 and d > (select d from t2){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values
[ https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-46056: --- Labels: pull-request-available (was: ) > Vectorized parquet reader throws NPE when reading files with DecimalType > default values > --- > > Key: SPARK-46056 > URL: https://issues.apache.org/jira/browse/SPARK-46056 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Cosmin Dumitru >Priority: Major > Labels: pull-request-available > > The scenario is a bit more complicated than what the title says but it's not > that far fetched. > # Write a parquet file with one column > # Evolve the schema and add a new column with DecimalType wide enough that > it doesn't fit in a long and has a default value. > # Try to read the file with the new schema > # NPE > The issue lies in how the column vector stores DecimalTypes. It incorrectly > assumes that they fit in a long and try to write it to associated long array. > [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] > > In OnHeapColumnVector which extends WritableColumVector reserveInternal() > checks if the type is too wide and initializes the array elements. > [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568] > isArray() returns true if the type is byteArrayDecimalType > [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945] > > Without the fix > {code:java} > [info] Cause: java.lang.NullPointerException: > [info] at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370) > [info] at > org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611) > [info] at > org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745) > [info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95) > [info] at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286) > [info] at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306) > [info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293) > [info] at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218) > [info] at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280) > [info] at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) > [info] at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > [info] at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > [info] at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > [info] at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code} > fix PR [https://github.com/apache/spark/pull/43960] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values
[ https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-46056: --- Affects Version/s: 4.0.0 > Vectorized parquet reader throws NPE when reading files with DecimalType > default values > --- > > Key: SPARK-46056 > URL: https://issues.apache.org/jira/browse/SPARK-46056 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0, 4.0.0 >Reporter: Cosmin Dumitru >Priority: Major > > The scenario is a bit more complicated than what the title says but it's not > that far fetched. > # Write a parquet file with one column > # Evolve the schema and add a new column with DecimalType wide enough that > it doesn't fit in a long and has a default value. > # Try to read the file with the new schema > # NPE > The issue lies in how the column vector stores DecimalTypes. It incorrectly > assumes that they fit in a long and try to write it to associated long array. > [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] > > In OnHeapColumnVector which extends WritableColumVector reserveInternal() > checks if the type is too wide and initializes the array elements. > [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568] > isArray() returns true if the type is byteArrayDecimalType > [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945] > > Without the fix > {code:java} > [info] Cause: java.lang.NullPointerException: > [info] at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370) > [info] at > org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611) > [info] at > org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745) > [info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95) > [info] at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286) > [info] at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306) > [info] at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293) > [info] at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218) > [info] at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280) > [info] at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) > [info] at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > [info] at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > [info] at > org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) > [info] at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) > [info] at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code} > fix PR [https://github.com/apache/spark/pull/43960] > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values
[ https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-46056: --- Description: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] In OnHeapColumnVector which extends WritableColumVector reserveInternal() checks if the type is too wide and initializes the array elements. [https://github.com/apache/spark/blob/b568ba43f0dd80130bca1bf86c48d0d359e57883/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java#L568] isArray() returns true if the type is byteArrayDecimalType [https://github.com/apache/spark/blob/afebf8e6c9f24d264580d084cb12e3e6af120a5a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L945] Without the fix {code:java} [info] Cause: java.lang.NullPointerException: [info] at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) [info] at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) [info] at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) [info] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code} fix PR [https://github.com/apache/spark/pull/43960] was: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] Without the fix {code:java} [info] Cause: java.lang.NullPointerException: [info] at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReade
[jira] [Updated] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values
[ https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-46056: --- Description: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] Without the fix {code:java} [info] Cause: java.lang.NullPointerException: [info] at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) [info] at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) [info] at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) [info] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code} fix PR https://github.com/apache/spark/pull/43960 was: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] Without the fix {code:java} [info] Cause: java.lang.NullPointerException: [info] at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextI
[jira] [Updated] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values
[ https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-46056: --- Description: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] Without the fix {code:java} [info] Cause: java.lang.NullPointerException: [info] at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(OnHeapColumnVector.java:370) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendLongs(WritableColumnVector.java:611) [info] at org.apache.spark.sql.execution.vectorized.WritableColumnVector.appendObjects(WritableColumnVector.java:745) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.(ParquetColumnVector.java:95) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:286) [info] at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:306) [info] at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:293) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:280) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:130) [info] at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) [info] at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388) [info] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:891){code} was: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724 > Vectorized parquet reader throws NPE when reading files with DecimalType > default values > --- > > Key: SPARK-46056 > URL: https://issues.apache.org/jira/browse/SPARK-46056 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > > The scenario is a bit more complicated than what the title says but it's not > that far fetched. > # Write a parquet file with one column > # Evolve the schema and add a new column with DecimalType wide enough that > it doesn't fit in a long and has a default value. > # Try to read the file with the new schema > # NPE > The issue lies in how the column vector stores DecimalTypes. It incorrectly > assumes that they fit in a long and try to write it to associated long array. > [https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724] > > Without the fix > {code:java} > [info] Cause: java.lang.NullPointerException: > [info] at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLongs(
[jira] [Updated] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values
[ https://issues.apache.org/jira/browse/SPARK-46056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cosmin Dumitru updated SPARK-46056: --- Description: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724 was: The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. > Vectorized parquet reader throws NPE when reading files with DecimalType > default values > --- > > Key: SPARK-46056 > URL: https://issues.apache.org/jira/browse/SPARK-46056 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0, 3.5.0 >Reporter: Cosmin Dumitru >Priority: Major > > The scenario is a bit more complicated than what the title says but it's not > that far fetched. > # Write a parquet file with one column > # Evolve the schema and add a new column with DecimalType wide enough that > it doesn't fit in a long and has a default value. > # Try to read the file with the new schema > # NPE > The issue lies in how the column vector stores DecimalTypes. It incorrectly > assumes that they fit in a long and try to write it to associated long array. > https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L724 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46056) Vectorized parquet reader throws NPE when reading files with DecimalType default values
Cosmin Dumitru created SPARK-46056: -- Summary: Vectorized parquet reader throws NPE when reading files with DecimalType default values Key: SPARK-46056 URL: https://issues.apache.org/jira/browse/SPARK-46056 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0, 3.4.0 Reporter: Cosmin Dumitru The scenario is a bit more complicated than what the title says but it's not that far fetched. # Write a parquet file with one column # Evolve the schema and add a new column with DecimalType wide enough that it doesn't fit in a long and has a default value. # Try to read the file with the new schema # NPE The issue lies in how the column vector stores DecimalTypes. It incorrectly assumes that they fit in a long and try to write it to associated long array. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org