Spark : 2.2 Number of cores : 128 ( all allocated to spark) Filesystem : Alluxio 1.6 Block size on alluxio: 32MB
Input1 size : 586MB ( 150m records with only 1 column as int) Input2 size : 50MB ( 10m records with only 1 column as int) Input1 is spread across 20 parquet files. each file size is 29MB ( 1 alluxio block for each file) Input2 is also spread across 20 parquet files. Each file size is 2.2MB ( 1 alluxio block for each file) Operation : Read parquet as DF For Input1 : Number of tasks created is 120 For Input2 : number of tasks created is 20 How the number of tasks calculated for both? secondly, If i look at task Details UI I am seeing some tasks "Input size" as some xxx bytes while for some its in MB Further investigation shows me exactly 20 tasks Input size is around 29MB and rest 100 threads is some bytes. We are using parquet-cpp to generate parquet files and then reading those files in spark. We want to know how the tasks are generated around 120 ( it should be 20 )? Its blocking our core utilization Thanks Regards Sanjeev -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org