Hi Spark users, I am running a pagerank-style algorithm on Bagel and bumping into "out of memory" issues with that.
Referring to the following table, rdd_120 is the rdd of vertices, serialized and compressed in memory. On each iteration, Bagel deserializes the compressed rdd. e.g. rdd_126 shows the uncompressed version of rdd_120 persisted in memory and disk. As iterations keep piling on, the cached partitions start getting evicted. The moment a rdd_120 partition gets evicted, it necessitates a recomputations and the performance goes for a toss. Although we don't need uncompressed rdds from previous iterations, they are the last ones to get evicted thanks to LRU policy. Should I make Bagel use DISK_ONLY persistence? How much of a performance hit would that be? Or maybe there is a better solution here. Storage RDD NameStorage Level Cached PartitionsFraction Cached Size in MemorySize on Disk rdd_83<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=83>Memory Serialized1x Replicated2312%83.7 MB0.0 B rdd_95<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=95>Memory Serialized1x Replicated23 12% 2.5 MB 0.0 B rdd_120<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=120>Memory Serialized1x Replicated2513%761.1 MB0.0 B rdd_126<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=126>Disk Memory Deserialized 1x Replicated192 100% 77.9 GB 1016.5 MB rdd_134<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=134>Disk Memory Deserialized 1x Replicated18596%60.8 GB475.4 MB Thanks and regards, ~Mayuresh
