[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8

2020-12-06 Thread GitBox


wangyum commented on pull request #30517:
URL: https://github.com/apache/spark/pull/30517#issuecomment-739699628


   cc @gszadovszky @shangxinli The problem is that DataSource v2 will push down 
the partition filter to 
[`SpecificParquetRecordReaderBase`](https://github.com/apache/spark/blob/26badc4cc4ea6d68c8c5d50cf2c83e4904aacc0d/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L138-L153)
 or 
[`InternalParquetRecordReader`](https://github.com/apache/parquet-mr/blob/86808425d5f67ddb2b4c80b1d1a06015bc92be10/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L185).
 If we use `reader.getFilteredRecordCount()`, the result is empty. Do you think 
it should be fixed on the Parquet side or Spark side?
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8

2020-12-06 Thread GitBox


wangyum commented on pull request #30517:
URL: https://github.com/apache/spark/pull/30517#issuecomment-739688414


   @dongjoon-hyun It's a bug of datasource v2. `VectorizedParquetRecordReader` 
also has this issue if we change [these 
lines](https://github.com/apache/spark/blob/26badc4cc4ea6d68c8c5d50cf2c83e4904aacc0d/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java#L149-L152)
 to:
   ```java
   this.totalRowCount += reader.getFilteredRecordCount();
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8

2020-12-06 Thread GitBox


wangyum commented on pull request #30517:
URL: https://github.com/apache/spark/pull/30517#issuecomment-739509750


   retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8

2020-12-06 Thread GitBox


wangyum commented on pull request #30517:
URL: https://github.com/apache/spark/pull/30517#issuecomment-739500874


   retest this please.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8

2020-12-05 Thread GitBox


wangyum commented on pull request #30517:
URL: https://github.com/apache/spark/pull/30517#issuecomment-739244254


   For `Spark vectorized reader - with partition data column ...` cases. It 
failed because DatasourceV2 the partition filter to 
`InternalParquetRecordReader`, and parquet will return empty results for 
non-existent column since  PARQUET-1765.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #30517: [DO-NOT-MERGE][test-maven] Test compatibility for Parquet 1.11.1, Avro 1.10.1 and Hive 2.3.8

2020-12-03 Thread GitBox


wangyum commented on pull request #30517:
URL: https://github.com/apache/spark/pull/30517#issuecomment-738576908


   Failed test cases since upgrade Parquet to 1.11.1 @viirya :
   ```
   - Spark vectorized reader - with partition data column - select a single 
complex field array and its parent struct array *** FAILED ***
   - Non-vectorized reader - with partition data column - select a single 
complex field array and its parent struct array *** FAILED ***
   - Spark vectorized reader - with partition data column - select a single 
complex field from a map entry and its parent map entry *** FAILED ***
   - Non-vectorized reader - with partition data column - select a single 
complex field from a map entry and its parent map entry *** FAILED ***
   - Spark vectorized reader - with partition data column - partial schema 
intersection - select missing subfield *** FAILED ***
   - Non-vectorized reader - with partition data column - partial schema 
intersection - select missing subfield *** FAILED ***
   - Spark vectorized reader - with partition data column - no unnecessary 
schema pruning *** FAILED ***
   - Non-vectorized reader - with partition data column - no unnecessary schema 
pruning *** FAILED ***
   - Spark vectorized reader - with partition data column - empty schema 
intersection *** FAILED ***
   - Non-vectorized reader - with partition data column - empty schema 
intersection *** FAILED ***
   - Spark vectorized reader - with partition data column - select a single 
complex field and in where clause *** FAILED ***
   - Non-vectorized reader - with partition data column - select a single 
complex field and in where clause *** FAILED ***
   - Spark vectorized reader - with partition data column - select nullable 
complex field and having is not null predicate *** FAILED ***
   - Non-vectorized reader - with partition data column - select nullable 
complex field and having is not null predicate *** FAILED ***
   - sql => parquet: map - non-standard *** FAILED ***
   - sql => parquet: map - group type key *** FAILED ***
   - sql => parquet: deeply nested type - non-standard *** FAILED ***
   - sql => parquet: Backwards-compatibility: MAP with non-nullable value type 
- 2 - prior to 1.4.x *** FAILED ***
   - sql => parquet: Backwards-compatibility: MAP with nullable value type - 3 
- prior to 1.4.x *** FAILED ***
   - DataFrame reuse *** FAILED ***
   *** 20 TESTS FAILED ***
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org