Was trying something basic to understand tasks stages and shuffles a bit
better in Spark. The dataset is 256 MB

Tried this in zeppelin

val tmpDF = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")

This kicked off 4 jobs -

   - 3 jobs for the first statement and
   - 1 job with 2 stages for tmpDF.count

The last stage of the job that corresponded to the count statement has some
puzzling data that i'm unable to explain.


   Stage details section says "Input Size / Records: 186.6 MB / 7200000 "
   and Aggregated Metrics by Executor says "Input Size / Records " to be
   "186.6 MB / 5371292" - Stage Details UI

   In the tasks list, one particular server
   ip-x-x-x-60.eu-west-1.compute.internal has 4 tasks with "0.0 B / 457130" as
   the value for "Input Size / Records " - Task Details UI

I initially thought this is some local disk cache or something that has to
do with EMRFS. But however, once I cached the dataframe and took the count
again, it showed up "16.8 MB / 46" for all 16 tasks corresponding to the 16

Any links/pointers to understand this a bit better would be highly helpful

Reply via email to