Hello there, I am trying to understand how I could improve (or increase) the parallelism of tasks that run for a particular spark job. Here is my observation...
scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size() 25 > hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l 200 > hadoop fs -du -h -s hdfs://somefile 2.2 G I notice that, depending on what the repartition / coalesce the number of part files to HDFS is created appropriately during the save operation. Meaning the number of part files can be tweaked according to this parameter. But, how do I control the 'partitions.size()'? Meaning, I want to have this to be 200 (without having to repartition it during the read operation so that I would be able have more number of tasks run for this job) This has a major impact in-terms of the time it takes to perform query operations on this job. On a side note, I do understand that 200 parquet part files for the above 2.2 G seems over-kill for a 128 MB block size. Ideally it should be 18 parts or so. Please advice, Muthu