Hi All, I was varying the storage levels of RDD caching in the KMeans program implemented using the MLib library and got some very confusing and interesting results. The base code of the application is from a Benchmark suite named SparkBench <https://github.com/CODAIT/spark-bench> . I changed the storage levels of the data RDD passed to the Kmeans train function and it seems like MEMORY_AND_DISK_SER is performing quite worse compared to DISK_ONLY level. MEMORY_AND_DISK level performed the best as expected. But as to why Memory serialized storage level is performing worse than Disk serialized level is very confusing. I am using 1 node as master and 4 nodes as slaves with each executor having a 48g JVM. The cached data should also fit within the memory easily.
If anyone has any idea or suggestion on why this behavior is happening please let me know. Regards, Muhib -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org