Hi,
I'm working on a project in Spark and am trying to understand what's
going on. Right now to try and understand what's happening we came up
with this snippet of code which very roughly resembles what we're
actually doing. When trying to run this our master node ends up quickly
using up its memory even though all of our RDDs are very small. Can
someone explain what's going on here and how we can avoid it?
a = sc.parallelize(xrange(100),10)
b = a
for i in xrange(100000):
a = a.map(lambda x: x + 1)
if i % 300 == 0:
# We do this to try and force some of our RDD to evaluate
a.persist()
a.foreachPartition(lambda _: None)
b.unpersist()
b = a
a.collect()
b.unpersist()
-Richard Hofer
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]