Hello there,

I am trying to understand how I could improve (or increase) the parallelism
of tasks that run for a particular spark job.
Here is my observation...

scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size()
25

> hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l
200

> hadoop fs -du -h -s hdfs://somefile
2.2 G

I notice that, depending on what the repartition / coalesce the number of
part files to HDFS is created appropriately during the save operation.
Meaning the number of part files can be tweaked according to this parameter.

But, how do I control the 'partitions.size()'? Meaning, I want to have this
to be 200 (without having to repartition it during the read operation so
that I would be able have more number of tasks run for this job)
This has a major impact in-terms of the time it takes to perform query
operations on this job.

On a side note, I do understand that 200 parquet part files for the above
2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18
parts or so.

Please advice,
Muthu

Reply via email to