Was trying something basic to understand tasks stages and shuffles a bit better in Spark. The dataset is 256 MB
Tried this in zeppelin val tmpDF = spark.read .option("header", "true") .option("delimiter", ",") .option("inferSchema", "true") .csv("s3://l4b-d4t4/wikipedia/pageviews-by-second-tsv") tmpDF.count This kicked off 4 jobs - - 3 jobs for the first statement and - 1 job with 2 stages for tmpDF.count The last stage of the job that corresponded to the count statement has some puzzling data that i'm unable to explain. 1. Stage details section says "Input Size / Records: 186.6 MB / 7200000 " and Aggregated Metrics by Executor says "Input Size / Records " to be "186.6 MB / 5371292" - Stage Details UI <https://i.stack.imgur.com/KeeiY.png> 2. In the tasks list, one particular server ip-x-x-x-60.eu-west-1.compute.internal has 4 tasks with "0.0 B / 457130" as the value for "Input Size / Records " - Task Details UI <https://i.stack.imgur.com/D6ogR.jpg> I initially thought this is some local disk cache or something that has to do with EMRFS. But however, once I cached the dataframe and took the count again, it showed up "16.8 MB / 46" for all 16 tasks corresponding to the 16 partitions. Any links/pointers to understand this a bit better would be highly helpful