Hi team, I was wondering if it's possible to leverage Spark's built optimisations for COPY_ON_WRITE tables with PySpark ?
The documentation here : https://hudi.apache.org/docs/querying_data.html Describes how to do this for Scala/Java : "If using spark’s built in support, additionally a path filter needs to be pushed into sparkContext as follows. This method retains Spark built-in optimizations for reading parquet files like vectorized reading on Hudi Hive tables. spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter], classOf[org.apache.hadoop.fs.PathFilter]); " Regards, Karl
