Hi, my team is setting up a machine-learning framework based on Spark's mlib, that currently uses logistic regression. I enabled Kryo serialization and enforced class registration, so I know that all the serialized classes are registered. However, the running times when Kryo serialization is enabled are consistently longer. This is true both when running locally on a smaller samples (1.6 minutes vs 1.3m) and also when running with a larger sample on AWS with two workers nodes (2h30 vs 1h50).
Using the monitoring tools suggests that Task Deserialization Times are similar (although perhaps slightly longer for Kryo), but Task Durations and even Scheduler Delays increase significantly. There is also a significant difference in memory usage: for Kryo the number of stored RDDs is higher (much more so on the local sample: 40 vs. 4). Does anyone have an idea of what can be going on, or where should I focus to find out? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Enabling-kryo-serialization-slows-down-machine-learning-app-tp24947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org