Re: trouble with broadcast variables on pyspark

Josh Rosen Fri, 07 Feb 2014 11:53:36 -0800

I've opened an issue for this on JIRA:
https://spark-project.atlassian.net/browse/SPARK-1065


To clarify, is the driver JVM running out of memory with an
OutOfMemoryError?  Or is the Python process exceeding some memory limit?


On Fri, Feb 7, 2014 at 12:16 AM, Sandy Ryza <[email protected]> wrote:

> I'm running into an issue when trying to broadcast large variables with
> pyspark.
>
> A ~1GB array seems to be blowing up beyond the size of the driver
> machine's memory when it's pickled.
>
> I've tried to get around this by broadcasting smaller chunks of it one at
> a time.  But I'm still running out of memory, ostensibly because the
> intermediate pickled versions aren't getting garbage collected.
>
> Any ideas on how to get around this?  Is this some sort of py4j
> limitation?  Is there any reason that the Spark driver would be keeping the
> pickled version around?
>
> thanks in advance for any help,
> Sandy
>

Re: trouble with broadcast variables on pyspark

Reply via email to