Hi,

I created a simple synthetic test which does a sample calculation twice, each 
time with different partitioning:
def time[R](block: => R): Long = {
  val t0 = System.currentTimeMillis()
  block    // call-by-name
  val t1 = System.currentTimeMillis()
  t1 - t0
}

val base_df = spark.range(10000001).withColumn("A", $"id" % 65)
val df1 = base_df.cache()
val df2 = base_df.repartition(201).cache()
df1.count()
df2.count()
val t1 = time { df1.groupBy("A").agg(sum($"id")).collect() }
val t2 = time { df2.groupBy("A").agg(sum($"id")).collect() }
println(s"first took $t1, second took $t2, ratio=${t2/t1}")

Now I ran this on three different platforms (all using local[2] master):

1.       In windows on a laptop I get: first took 2454, second took 48089, 
ratio=19

2.       In a VM on the same laptop (under centos) I get: first took 2911, 
second took 8699, ratio=2

3.       On another VM with a lot more power (but still using local[2] as 
master) I get: first took 1483, second took 2354, ratio=1
I can't figure out these numbers. I expected to see the behavior on the strong 
VM where the difference was relatively small but why would it be such a larger 
ratio on the local VM and a ratio of 19 On windows

Can anyone help me figure out why this is?
Assaf.

Reply via email to