[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-30 Thread GitBox


umehrot2 commented on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-810659252


   @pengzhiwei2018 please also fix the commit message. We don't need the whole 
history in the commit message like:
   ```
   add test case
   
   remove used imports
   ...
   ```
   Also I feel the commit message should be: `Implement Spark's FileIndex for 
Hudi to support queries via Hudi DataSource using non-globbed table path and 
partition pruning`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-26 Thread GitBox


umehrot2 commented on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-808606053


   @pengzhiwei2018 I was testing Hudi without this patch via Spark SQL and I am 
a little confused. With Spark SQL I see partition pruning already works 
seamlessly for Hudi. Just start spark sql with:
   ```
   spark-sql --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 
"spark.hadoop.mapreduce.input.pathFilter.class=org.apache.hudi.hadoop.HoodieROTablePathFilter"
 --jars /usr/lib/hudi/hudi-spark-bundle.jar
   ```
   Spark is able to get the partition schema from the catalog using 
`CatalogFileIndex` and do the partition pruning. So this partition pruning 
support we are adding, is this to be able to support partition pruning for 
datasource based queries ? I think for hive style partition tables pruning 
should have already worked via Spark datasource too, because Spark tries to 
identify partition columns from the path, but not sure why it does not work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-25 Thread GitBox


umehrot2 commented on pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#issuecomment-807830171


   Can you run these unit tests you added once with `-Pspark3` to make sure 
this is running seamlessly for Spark 3 ? The travis tests right now don't run 
the tests with Spark 3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org