Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-11-01 Thread Holden Karau
On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris 
wrote:

> have you deactivated the spark.ui ?
> I have read several thread explaining the ui can lead to OOM because it
> stores 1000 dags by default
>
>
> On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:
> > Dear List,
> >
> > I've observed some sort of memory leak when using pyspark to run ~100
> > jobs in local mode.  Each job is essentially a create RDD -> create DF
> > -> write DF sort of flow.  The RDD and DFs go out of scope after each
> > job completes, hence I call this issue a "memory leak."  Here's
> > pseudocode:
> >
> > ```
> > row_rdds = []
> > for i in range(100):
> >   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in
> range(1000)])
> >   row_rdds.append(row_rdd)
> >
> > for row_rdd in row_rdds:
> >   df = spark.createDataFrame(row_rdd)
> >   df.persist()
> >   print(df.count())
> >   df.write.save(...) # Save parquet
> >   df.unpersist()
> >
> >   # Does not help:
> >   # del df
> >   # del row_rdd
> > ```

The connection between Python GC/del and JVM GC is perhaps a bit weaker
than we might like. There certainly could be a problem here, but it still
shouldn’t be getting to the OOM state.

>
> >
> > In my real application:
> >  * rows are much larger, perhaps 1MB each
> >  * row_rdds are sized to fit available RAM
> >
> > I observe that after 100 or so iterations of the second loop (each of
> > which creates a "job" in the Spark WebUI), the following happens:
> >  * pyspark workers have fairly stable resident and virtual RAM usage
> >  * java process eventually approaches resident RAM cap (8GB standard)
> > but virtual RAM usage keeps ballooning.
> >

Can you share what flags the JVM is launching with? Also which JVM(s) are
ballooning?

>
> > Eventually the machine runs out of RAM and the linux OOM killer kills
> > the java process, resulting in an "IndexError: pop from an empty
> > deque" error from py4j/java_gateway.py .
> >
> >
> > Does anybody have any ideas about what's going on?  Note that this is
> > local mode.  I have personally run standalone masters and submitted a
> > ton of jobs and never seen something like this over time.  Those were
> > very different jobs, but perhaps this issue is bespoke to local mode?
> >
> > Emphasis: I did try to del the pyspark objects and run python GC.
> > That didn't help at all.
> >
> > pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
> >
> > 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
> >
> > Cheers,
> > -Paul
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Nicolas Paris
have you deactivated the spark.ui ?
I have read several thread explaining the ui can lead to OOM because it
stores 1000 dags by default


On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:
> Dear List,
> 
> I've observed some sort of memory leak when using pyspark to run ~100
> jobs in local mode.  Each job is essentially a create RDD -> create DF
> -> write DF sort of flow.  The RDD and DFs go out of scope after each
> job completes, hence I call this issue a "memory leak."  Here's
> pseudocode:
> 
> ```
> row_rdds = []
> for i in range(100):
>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>   row_rdds.append(row_rdd)
> 
> for row_rdd in row_rdds:
>   df = spark.createDataFrame(row_rdd)
>   df.persist()
>   print(df.count())
>   df.write.save(...) # Save parquet
>   df.unpersist()
> 
>   # Does not help:
>   # del df
>   # del row_rdd
> ```
> 
> In my real application:
>  * rows are much larger, perhaps 1MB each
>  * row_rdds are sized to fit available RAM
> 
> I observe that after 100 or so iterations of the second loop (each of
> which creates a "job" in the Spark WebUI), the following happens:
>  * pyspark workers have fairly stable resident and virtual RAM usage
>  * java process eventually approaches resident RAM cap (8GB standard)
> but virtual RAM usage keeps ballooning.
> 
> Eventually the machine runs out of RAM and the linux OOM killer kills
> the java process, resulting in an "IndexError: pop from an empty
> deque" error from py4j/java_gateway.py .
> 
> 
> Does anybody have any ideas about what's going on?  Note that this is
> local mode.  I have personally run standalone masters and submitted a
> ton of jobs and never seen something like this over time.  Those were
> very different jobs, but perhaps this issue is bespoke to local mode?
> 
> Emphasis: I did try to del the pyspark objects and run python GC.
> That didn't help at all.
> 
> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
> 
> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
> 
> Cheers,
> -Paul
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-31 Thread Paul Wais
Well, dumb question:

Given the workflow outlined above, should Local Mode keep running?  Or
is the leak a known issue?  I just wanted to check because I can't
recall seeing this issue with a non-local master, though it's possible
there were task failures that hid the issue.

If this issue looks new, what's the easiest way to record memory dumps
or do profiling?  Can I put something in my spark-defaults.conf ?

The code is open source and the run is reproducible, although the
specific test currently requires a rather large (but public) dataset.

On Sun, Oct 20, 2019 at 6:24 PM Jungtaek Lim
 wrote:
>
> Honestly I'd recommend you to spend you time to look into the issue, via 
> taking memory dump per some interval and compare differences (at least share 
> these dump files to community with redacting if necessary). Otherwise someone 
> has to try to reproduce without reproducer and even couldn't reproduce even 
> they spent their time. Memory leak issue is not really easy to reproduce, 
> unless it leaks some objects without any conditions.
>
> - Jungtaek Lim (HeartSaVioR)
>
> On Sun, Oct 20, 2019 at 7:18 PM Paul Wais  wrote:
>>
>> Dear List,
>>
>> I've observed some sort of memory leak when using pyspark to run ~100
>> jobs in local mode.  Each job is essentially a create RDD -> create DF
>> -> write DF sort of flow.  The RDD and DFs go out of scope after each
>> job completes, hence I call this issue a "memory leak."  Here's
>> pseudocode:
>>
>> ```
>> row_rdds = []
>> for i in range(100):
>>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>>   row_rdds.append(row_rdd)
>>
>> for row_rdd in row_rdds:
>>   df = spark.createDataFrame(row_rdd)
>>   df.persist()
>>   print(df.count())
>>   df.write.save(...) # Save parquet
>>   df.unpersist()
>>
>>   # Does not help:
>>   # del df
>>   # del row_rdd
>> ```
>>
>> In my real application:
>>  * rows are much larger, perhaps 1MB each
>>  * row_rdds are sized to fit available RAM
>>
>> I observe that after 100 or so iterations of the second loop (each of
>> which creates a "job" in the Spark WebUI), the following happens:
>>  * pyspark workers have fairly stable resident and virtual RAM usage
>>  * java process eventually approaches resident RAM cap (8GB standard)
>> but virtual RAM usage keeps ballooning.
>>
>> Eventually the machine runs out of RAM and the linux OOM killer kills
>> the java process, resulting in an "IndexError: pop from an empty
>> deque" error from py4j/java_gateway.py .
>>
>>
>> Does anybody have any ideas about what's going on?  Note that this is
>> local mode.  I have personally run standalone masters and submitted a
>> ton of jobs and never seen something like this over time.  Those were
>> very different jobs, but perhaps this issue is bespoke to local mode?
>>
>> Emphasis: I did try to del the pyspark objects and run python GC.
>> That didn't help at all.
>>
>> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
>>
>> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
>>
>> Cheers,
>> -Paul
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-10-20 Thread Jungtaek Lim
Honestly I'd recommend you to spend you time to look into the issue, via
taking memory dump per some interval and compare differences (at least
share these dump files to community with redacting if necessary).
Otherwise someone has to try to reproduce without reproducer and even
couldn't reproduce even they spent their time. Memory leak issue is not
really easy to reproduce, unless it leaks some objects without any
conditions.

- Jungtaek Lim (HeartSaVioR)

On Sun, Oct 20, 2019 at 7:18 PM Paul Wais  wrote:

> Dear List,
>
> I've observed some sort of memory leak when using pyspark to run ~100
> jobs in local mode.  Each job is essentially a create RDD -> create DF
> -> write DF sort of flow.  The RDD and DFs go out of scope after each
> job completes, hence I call this issue a "memory leak."  Here's
> pseudocode:
>
> ```
> row_rdds = []
> for i in range(100):
>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>   row_rdds.append(row_rdd)
>
> for row_rdd in row_rdds:
>   df = spark.createDataFrame(row_rdd)
>   df.persist()
>   print(df.count())
>   df.write.save(...) # Save parquet
>   df.unpersist()
>
>   # Does not help:
>   # del df
>   # del row_rdd
> ```
>
> In my real application:
>  * rows are much larger, perhaps 1MB each
>  * row_rdds are sized to fit available RAM
>
> I observe that after 100 or so iterations of the second loop (each of
> which creates a "job" in the Spark WebUI), the following happens:
>  * pyspark workers have fairly stable resident and virtual RAM usage
>  * java process eventually approaches resident RAM cap (8GB standard)
> but virtual RAM usage keeps ballooning.
>
> Eventually the machine runs out of RAM and the linux OOM killer kills
> the java process, resulting in an "IndexError: pop from an empty
> deque" error from py4j/java_gateway.py .
>
>
> Does anybody have any ideas about what's going on?  Note that this is
> local mode.  I have personally run standalone masters and submitted a
> ton of jobs and never seen something like this over time.  Those were
> very different jobs, but perhaps this issue is bespoke to local mode?
>
> Emphasis: I did try to del the pyspark objects and run python GC.
> That didn't help at all.
>
> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
>
> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
>
> Cheers,
> -Paul
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>