I'm running into an issue when trying to broadcast large variables with pyspark.
A ~1GB array seems to be blowing up beyond the size of the driver machine's memory when it's pickled. I've tried to get around this by broadcasting smaller chunks of it one at a time. But I'm still running out of memory, ostensibly because the intermediate pickled versions aren't getting garbage collected. Any ideas on how to get around this? Is this some sort of py4j limitation? Is there any reason that the Spark driver would be keeping the pickled version around? thanks in advance for any help, Sandy
