Hi All, I use the following to read a set of parquet file paths when files are scattered across many many partitions.
paths = ['p1', 'p2', ... 'p10000'] df = spark.read.parquet(*paths) Above method feels like is sequentially reading those files & not really parallelizing the read operation, is that correct? If I put all these files in a single path and read like below - works faster: path = 'consolidated_path' df = spark.read.parquet(path) Is my observation correct? If so, is there a way to optimize reads from multiple/specific paths ? -- Regards, Rishi Shah