[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-810659252 @pengzhiwei2018 please also fix the commit message. We don't need the whole history in the commit message like: ``` add test case remove used imports ... ``` Also I feel the commit message should be: `Implement Spark's FileIndex for Hudi to support queries via Hudi DataSource using non-globbed table path and partition pruning` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-808606053 @pengzhiwei2018 I was testing Hudi without this patch via Spark SQL and I am a little confused. With Spark SQL I see partition pruning already works seamlessly for Hudi. Just start spark sql with: ``` spark-sql --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter" --jars /usr/lib/hudi/hudi-spark-bundle.jar ``` Spark is able to get the partition schema from the catalog using `CatalogFileIndex` and do the partition pruning. So this partition pruning support we are adding, is this to be able to support partition pruning for datasource based queries ? I think for hive style partition tables pruning should have already worked via Spark datasource too, because Spark tries to identify partition columns from the path, but not sure why it does not work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
umehrot2 commented on pull request #2651: URL: https://github.com/apache/hudi/pull/2651#issuecomment-807830171 Can you run these unit tests you added once with `-Pspark3` to make sure this is running seamlessly for Spark 3 ? The travis tests right now don't run the tests with Spark 3. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org