[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-18 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011671#comment-15011671
 ] 

Marcelo Vanzin commented on SPARK-11655:


That test doesn't use the launcher code, so it's most probably a different 
issue. I'll attach the jstack log to this bug just because there isn't a better 
place currently to do that.

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.6.0
>
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-18 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011687#comment-15011687
 ] 

shane knapp commented on SPARK-11655:
-

ah ok, good to know.  i'll kill that process now, if it's cool.

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.6.0
>
> Attachments: month_of_doom.png, screenshot-1.png, stack.log, 
> year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-18 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011700#comment-15011700
 ] 

Marcelo Vanzin commented on SPARK-11655:


yeah, it's fine. Also probably good to file a separate bug with the stack.log 
file I just attached.

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.6.0
>
> Attachments: month_of_doom.png, screenshot-1.png, stack.log, 
> year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-18 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15011703#comment-15011703
 ] 

shane knapp commented on SPARK-11655:
-

sounds good to me.  you can re-close this bug and i'll open a new one.
thanks, marcelo!



> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.6.0
>
> Attachments: month_of_doom.png, screenshot-1.png, stack.log, 
> year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-16 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006978#comment-15006978
 ] 

shane knapp commented on SPARK-11655:
-

just wanted to say that things are definitely looking a LOT better!  i'll keep 
an eye on things this week, but we're definitely out of the woods.

thanks [~joshrosen] and [~vanzin]!

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.6.0
>
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000739#comment-15000739
 ] 

Marcelo Vanzin commented on SPARK-11655:


I'll take a look at the code.

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000856#comment-15000856
 ] 

Apache Spark commented on SPARK-11655:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9633

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000424#comment-15000424
 ] 

shane knapp commented on SPARK-11655:
-

actually, this started back in mid-may, but the impact definitely ramped up in 
october.   if you look at the attached graph, you'll see the overall process 
count ramp up then.  i was out of the office for most of august, and you can 
see the effect of me not being around to keep an eye on things.  then in 
october, things got REALLY bad (see second attached) and i started killing off 
the hanging processes daily, as evidenced by the sawtooth pattern.

this affects all spark-test builds, btw.

that being said, i like the socket/stream solution...  nice find!

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org