Hi Spark users,

I am running a pagerank-style algorithm on Bagel and bumping into "out of
memory" issues with that.

Referring to the following table, rdd_120 is the rdd of vertices,
serialized and compressed in memory. On each iteration, Bagel deserializes
the compressed rdd. e.g. rdd_126 shows the uncompressed version of rdd_120
persisted in memory and disk. As iterations keep piling on, the cached
partitions start getting evicted. The moment a rdd_120 partition gets
evicted, it necessitates a recomputations and the performance goes for a
toss. Although we don't need uncompressed rdds from previous iterations,
they are the last ones to get evicted thanks to LRU policy.

Should I make Bagel use DISK_ONLY persistence? How much of a performance
hit would that be? Or maybe there is a better solution here.

Storage
 RDD NameStorage Level Cached PartitionsFraction Cached Size in MemorySize
on Disk 
rdd_83<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=83>Memory
Serialized1x Replicated2312%83.7 MB0.0 B
rdd_95<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=95>Memory
Serialized1x Replicated23
12% 2.5 MB 0.0 B
rdd_120<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=120>Memory
Serialized1x Replicated2513%761.1 MB0.0 B
rdd_126<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=126>Disk
Memory Deserialized 1x Replicated192
100% 77.9 GB 1016.5 MB
rdd_134<http://ec2-54-234-176-171.compute-1.amazonaws.com:4040/storage/rdd?id=134>Disk
Memory Deserialized 1x Replicated18596%60.8 GB475.4 MB
Thanks and regards,
~Mayuresh

Reply via email to