Hello all,
I have a really strange thing going on.
I have a test data set with 500K lines in a gzipped csv file.
I have an array of "column processors," one for each column in the dataset.
A Processor tracks aggregate state and has a method "process(v : String)"
I'm calling:
val processors: Array[Processors] = ....
sc.textFile(gzippedFileName).aggregate(processors,
{ (curState, row) =>
row.split(",", -1).zipWithIndex.foreach({
v => curState(v._2).process(v._1)
})
curState
} ....)
If the class definition for the Processors is in the same file as the driver
it runs in ~23 seconds. If I move the classes to a separate file in the same
package without ANY OTHER CHANGES it goes to ~35 seconds.
This doesn't make any sense to me. I can't even understand how the compiled
class files could be any different in either case.
Does anyone have an explanation for why this might be?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Wildly-varying-aggregate-performance-depending-on-code-location-tp18752.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]