[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25144:
-
Fix Version/s: 2.3.2
   2.2.3

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.2.3, 2.3.2
>
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-25144:

Target Version/s: 2.2.3, 2.3.2

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25144:
--
Affects Version/s: 2.0.2

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25144:
--
Environment: (was: spark 2.3.1)

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.3.2
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25144:
--
Affects Version/s: 2.3.2
   2.1.3
   2.2.2

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.3.2
> Environment: spark 2.3.1
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-20 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25144:
--
Component/s: (was: Spark Core)
 (was: Optimizer)

> distinct on Dataset leads to exception due to Managed memory leak detected  
> 
>
> Key: SPARK-25144
> URL: https://issues.apache.org/jira/browse/SPARK-25144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.3.2
> Environment: spark 2.3.1
>Reporter: Ayoub Benali
>Priority: Major
>
> The following code example: 
> {code}
> case class Foo(bar: Option[String])
> val ds = List(Foo(Some("bar"))).toDS
> val result = ds.flatMap(_.bar).distinct
> result.rdd.isEmpty
> {code}
> Produces the following stacktrace
> {code}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
> stage 7.0 (TID 125, localhost, executor driver): 
> org.apache.spark.SparkException: Managed memory leak detected; size = 
> 16777216 bytes, TID = 125
> [info]at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]at java.lang.Thread.run(Thread.java:748)
> [info] 
> [info] Driver stacktrace:
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> [info]   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> [info]   at scala.Option.foreach(Option.scala:257)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> [info]   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> [info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> [info]   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> [info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
> [info]   at 
> org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> [info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> [info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)
> {code}
> The code example doesn't produce any error when `distinct` function is not 
> called.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25144) distinct on Dataset leads to exception due to Managed memory leak detected

2018-08-17 Thread Ayoub Benali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayoub Benali updated SPARK-25144:
-
Description: 
The following code example: 
{code}
case class Foo(bar: Option[String])
val ds = List(Foo(Some("bar"))).toDS
val result = ds.flatMap(_.bar).distinct
result.rdd.isEmpty
{code}
Produces the following stacktrace
{code}
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
stage 7.0 (TID 125, localhost, executor driver): 
org.apache.spark.SparkException: Managed memory leak detected; size = 16777216 
bytes, TID = 125
[info]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
[info]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]  at java.lang.Thread.run(Thread.java:748)
[info] 
[info] Driver stacktrace:
[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
[info]   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
[info]   at scala.Option.foreach(Option.scala:257)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
[info]   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
[info]   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
[info]   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
[info]   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
[info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
[info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
[info]   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
[info]   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1358)
[info]   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[info]   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
[info]   at org.apache.spark.rdd.RDD.take(RDD.scala:1331)
[info]   at 
org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1466)
[info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
[info]   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1466)
[info]   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[info]   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[info]   at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
[info]   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1465)

{code}
The code example doesn't produce any error when `distinct` function is not 
called.

  was:
The following code example: 
{code:scala}
case class Foo(bar: Option[String])
val ds = List(Foo(Some("bar"))).toDS
val result = ds.flatMap(_.bar).distinct
result.rdd.isEmpty
{code}

Produces the following stacktrace 

{code:bash}
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 42 in stage 7.0 failed 1 times, most recent failure: Lost task 42.0 in 
stage 7.0 (TID 125, localhost, executor driver): 
org.apache.spark.SparkException: Managed memory leak detected; size = 16777216 
bytes, TID = 125
[info]  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:358)
[info]  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]  at java.lang.Thread.run(Thread.java:748)
[info] 
[info] Driver stacktrace:
[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
[info]   at