[jira] [Assigned] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11655:


Assignee: Apache Spark

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11655:


Assignee: (was: Apache Spark)

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org