[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8
wangyum commented on pull request #30517: URL: https://github.com/apache/spark/pull/30517#issuecomment-739699628 cc @gszadovszky @shangxinli The problem is that DataSource v2 will push down the partition filter to [`SpecificParquetRecordReaderBase`](https://github.com/apache/spark/blob/26badc4cc4ea6d68c8c5d50cf2c83e4904aacc0d/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L138-L153) or [`InternalParquetRecordReader`](https://github.com/apache/parquet-mr/blob/86808425d5f67ddb2b4c80b1d1a06015bc92be10/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L185). If we use `reader.getFilteredRecordCount()`, the result is empty. Do you think it should be fixed on the Parquet side or Spark side? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8
wangyum commented on pull request #30517: URL: https://github.com/apache/spark/pull/30517#issuecomment-739688414 @dongjoon-hyun It's a bug of datasource v2. `VectorizedParquetRecordReader` also has this issue if we change [these lines](https://github.com/apache/spark/blob/26badc4cc4ea6d68c8c5d50cf2c83e4904aacc0d/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L149-L152) to: ```java this.totalRowCount += reader.getFilteredRecordCount(); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8
wangyum commented on pull request #30517: URL: https://github.com/apache/spark/pull/30517#issuecomment-739509750 retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8
wangyum commented on pull request #30517: URL: https://github.com/apache/spark/pull/30517#issuecomment-739500874 retest this please. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8
wangyum commented on pull request #30517: URL: https://github.com/apache/spark/pull/30517#issuecomment-739244254 For `Spark vectorized reader - with partition data column ...` cases. It failed because DatasourceV2 the partition filter to `InternalParquetRecordReader`, and parquet will return empty results for non-existent column since PARQUET-1765. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8
wangyum commented on pull request #30517: URL: https://github.com/apache/spark/pull/30517#issuecomment-738576908 Failed test cases since upgrade Parquet to 1.11.1 @viirya : ``` - Spark vectorized reader - with partition data column - select a single complex field array and its parent struct array *** FAILED *** - Non-vectorized reader - with partition data column - select a single complex field array and its parent struct array *** FAILED *** - Spark vectorized reader - with partition data column - select a single complex field from a map entry and its parent map entry *** FAILED *** - Non-vectorized reader - with partition data column - select a single complex field from a map entry and its parent map entry *** FAILED *** - Spark vectorized reader - with partition data column - partial schema intersection - select missing subfield *** FAILED *** - Non-vectorized reader - with partition data column - partial schema intersection - select missing subfield *** FAILED *** - Spark vectorized reader - with partition data column - no unnecessary schema pruning *** FAILED *** - Non-vectorized reader - with partition data column - no unnecessary schema pruning *** FAILED *** - Spark vectorized reader - with partition data column - empty schema intersection *** FAILED *** - Non-vectorized reader - with partition data column - empty schema intersection *** FAILED *** - Spark vectorized reader - with partition data column - select a single complex field and in where clause *** FAILED *** - Non-vectorized reader - with partition data column - select a single complex field and in where clause *** FAILED *** - Spark vectorized reader - with partition data column - select nullable complex field and having is not null predicate *** FAILED *** - Non-vectorized reader - with partition data column - select nullable complex field and having is not null predicate *** FAILED *** - sql => parquet: map - non-standard *** FAILED *** - sql => parquet: map - group type key *** FAILED *** - sql => parquet: deeply nested type - non-standard *** FAILED *** - sql => parquet: Backwards-compatibility: MAP with non-nullable value type - 2 - prior to 1.4.x *** FAILED *** - sql => parquet: Backwards-compatibility: MAP with nullable value type - 3 - prior to 1.4.x *** FAILED *** - DataFrame reuse *** FAILED *** *** 20 TESTS FAILED *** ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org