Hi team,

I was wondering if it's possible to leverage Spark's built optimisations
for COPY_ON_WRITE tables with PySpark ?

The documentation here : https://hudi.apache.org/docs/querying_data.html

Describes how to do this for Scala/Java :

"If using spark’s built in support, additionally a path filter needs to be
pushed into sparkContext as follows. This method retains Spark built-in
optimizations for reading parquet files like vectorized reading on Hudi
Hive tables.

spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
"

Regards,
Karl

Reply via email to