I'm running into an issue when trying to broadcast large variables with
pyspark.

A ~1GB array seems to be blowing up beyond the size of the driver machine's
memory when it's pickled.

I've tried to get around this by broadcasting smaller chunks of it one at a
time.  But I'm still running out of memory, ostensibly because the
intermediate pickled versions aren't getting garbage collected.

Any ideas on how to get around this?  Is this some sort of py4j limitation?
 Is there any reason that the Spark driver would be keeping the pickled
version around?

thanks in advance for any help,
Sandy

Reply via email to