[jira] [Commented] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

2020-04-01 Thread Ben Roling (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072857#comment-17072857
 ] 

Ben Roling commented on SPARK-30443:


Thanks [~xiaojuwu].  Using explain() on the test program [~ltrichter] provides 
does show SortMergeJoin and the same code does not exhibit the issue in Spark 
2.4.5.  It does seem SPARK-21492 is likely to be the cause.

> "Managed memory leak detected" even with no calls to take() or limit()
> --
>
> Key: SPARK-30443
> URL: https://issues.apache.org/jira/browse/SPARK-30443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.4, 3.0.0
>Reporter: Luke Richter
>Priority: Major
> Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, 
> even though we are not calling take() or limit().
> According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
> managed memory leaks should only be caused by not reading an iterator to 
> completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
> memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have created a minimal test program that reproduces the warning: 
> {code:java}
> import pyspark.sql
> import pyspark.sql.functions as fx
> def main():
> builder = pyspark.sql.SparkSession.builder
> builder = builder.appName("spark-jira")
> spark = builder.getOrCreate()
> reader = spark.read
> reader = reader.format("csv")
> reader = reader.option("inferSchema", "true")
> reader = reader.option("header", "true")
> table_c = reader.load("c.csv")
> table_a = reader.load("a.csv")
> table_b = reader.load("b.csv")
> primary_filter = fx.col("some_code").isNull()
> new_primary_data = table_a.filter(primary_filter)
> new_ids = new_primary_data.select("some_id")
> new_data = table_b.join(new_ids, "some_id")
> new_data = new_data.select("some_id")
> result = table_c.join(new_data, "some_id", "left")
> result.repartition(1).write.json("results.json", mode="overwrite")
> spark.stop()
> if __name__ == "__main__":
> main()
> {code}
> Our code isn't anything out of the ordinary, just some filters, selects and 
> joins.
> The input data is made up of 3 CSV files. The input data files are quite 
> large, roughly 2.6GB in total uncompressed. I attempted to reduce the number 
> of rows in the CSV input files but this caused the warning to no longer 
> appear. After compressing the files I was able to attach them below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

2020-03-26 Thread Xiaoju Wu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068290#comment-17068290
 ] 

Xiaoju Wu commented on SPARK-30443:
---

Also see this kind of warning logs. SPARK-21492 may relate to this warning. 
Does your code base contain it?

> "Managed memory leak detected" even with no calls to take() or limit()
> --
>
> Key: SPARK-30443
> URL: https://issues.apache.org/jira/browse/SPARK-30443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.4, 3.0.0
>Reporter: Luke Richter
>Priority: Major
> Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, 
> even though we are not calling take() or limit().
> According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
> managed memory leaks should only be caused by not reading an iterator to 
> completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
> memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have created a minimal test program that reproduces the warning: 
> {code:java}
> import pyspark.sql
> import pyspark.sql.functions as fx
> def main():
> builder = pyspark.sql.SparkSession.builder
> builder = builder.appName("spark-jira")
> spark = builder.getOrCreate()
> reader = spark.read
> reader = reader.format("csv")
> reader = reader.option("inferSchema", "true")
> reader = reader.option("header", "true")
> table_c = reader.load("c.csv")
> table_a = reader.load("a.csv")
> table_b = reader.load("b.csv")
> primary_filter = fx.col("some_code").isNull()
> new_primary_data = table_a.filter(primary_filter)
> new_ids = new_primary_data.select("some_id")
> new_data = table_b.join(new_ids, "some_id")
> new_data = new_data.select("some_id")
> result = table_c.join(new_data, "some_id", "left")
> result.repartition(1).write.json("results.json", mode="overwrite")
> spark.stop()
> if __name__ == "__main__":
> main()
> {code}
> Our code isn't anything out of the ordinary, just some filters, selects and 
> joins.
> The input data is made up of 3 CSV files. The input data files are quite 
> large, roughly 2.6GB in total uncompressed. I attempted to reduce the number 
> of rows in the CSV input files but this caused the warning to no longer 
> appear. After compressing the files I was able to attach them below.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org