Was trying something basic to understand tasks stages and shuffles a bit
better in Spark. The dataset is 256 MB

Tried this in zeppelin

val tmpDF = spark.read
  .option("header", "true")
  .option("delimiter", ",")
  .option("inferSchema", "true")
  .csv("s3://l4b-d4t4/wikipedia/pageviews-by-second-tsv")
tmpDF.count

This kicked off 4 jobs -

   - 3 jobs for the first statement and
   - 1 job with 2 stages for tmpDF.count

The last stage of the job that corresponded to the count statement has some
puzzling data that i'm unable to explain.

   1.

   Stage details section says "Input Size / Records: 186.6 MB / 7200000 "
   and Aggregated Metrics by Executor says "Input Size / Records " to be
   "186.6 MB / 5371292" - Stage Details UI
   <https://i.stack.imgur.com/KeeiY.png>
   2.

   In the tasks list, one particular server
   ip-x-x-x-60.eu-west-1.compute.internal has 4 tasks with "0.0 B / 457130" as
   the value for "Input Size / Records " - Task Details UI
   <https://i.stack.imgur.com/D6ogR.jpg>

I initially thought this is some local disk cache or something that has to
do with EMRFS. But however, once I cached the dataframe and took the count
again, it showed up "16.8 MB / 46" for all 16 tasks corresponding to the 16
partitions.

Any links/pointers to understand this a bit better would be highly helpful

Reply via email to