Hi, I created a simple synthetic test which does a sample calculation twice, each time with different partitioning: def time[R](block: => R): Long = { val t0 = System.currentTimeMillis() block // call-by-name val t1 = System.currentTimeMillis() t1 - t0 }
val base_df = spark.range(10000001).withColumn("A", $"id" % 65) val df1 = base_df.cache() val df2 = base_df.repartition(201).cache() df1.count() df2.count() val t1 = time { df1.groupBy("A").agg(sum($"id")).collect() } val t2 = time { df2.groupBy("A").agg(sum($"id")).collect() } println(s"first took $t1, second took $t2, ratio=${t2/t1}") Now I ran this on three different platforms (all using local[2] master): 1. In windows on a laptop I get: first took 2454, second took 48089, ratio=19 2. In a VM on the same laptop (under centos) I get: first took 2911, second took 8699, ratio=2 3. On another VM with a lot more power (but still using local[2] as master) I get: first took 1483, second took 2354, ratio=1 I can't figure out these numbers. I expected to see the behavior on the strong VM where the difference was relatively small but why would it be such a larger ratio on the local VM and a ratio of 19 On windows Can anyone help me figure out why this is? Assaf.