Wildly varying "aggregate" performance depending on code location

Jim Carroll Wed, 12 Nov 2014 11:34:59 -0800

Hello all,

I have a really strange thing going on.


I have a test data set with 500K lines in a gzipped csv file.

I have an array of "column processors," one for each column in the dataset.
A Processor tracks aggregate state and has a method "process(v : String)"

I'm calling:

  val processors: Array[Processors] = ....

  sc.textFile(gzippedFileName).aggregate(processors,
    { (curState, row) =>
        row.split(",", -1).zipWithIndex.foreach({
          v => curState(v._2).process(v._1)
        })
        curState
    } ....)

If the class definition for the Processors is in the same file as the driver
it runs in ~23 seconds. If I move the classes to a separate file in the same
package without ANY OTHER CHANGES it goes to ~35 seconds.

This doesn't make any sense to me. I can't even understand how the compiled
class files could be any different in either case.

Does anyone have an explanation for why this might be?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Wildly-varying-aggregate-performance-depending-on-code-location-tp18752.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Wildly varying "aggregate" performance depending on code location

Reply via email to