I've opened an issue for this on JIRA: https://spark-project.atlassian.net/browse/SPARK-1065
To clarify, is the driver JVM running out of memory with an OutOfMemoryError? Or is the Python process exceeding some memory limit? On Fri, Feb 7, 2014 at 12:16 AM, Sandy Ryza <[email protected]> wrote: > I'm running into an issue when trying to broadcast large variables with > pyspark. > > A ~1GB array seems to be blowing up beyond the size of the driver > machine's memory when it's pickled. > > I've tried to get around this by broadcasting smaller chunks of it one at > a time. But I'm still running out of memory, ostensibly because the > intermediate pickled versions aren't getting garbage collected. > > Any ideas on how to get around this? Is this some sort of py4j > limitation? Is there any reason that the Spark driver would be keeping the > pickled version around? > > thanks in advance for any help, > Sandy >
