[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432725#comment-16432725 ] shane knapp commented on SPARK-8571: just doing some email archaeology and found this. no, it's not an issue anymore. > spark streaming hanging processes upon build exit > - > > Key: SPARK-8571 > URL: https://issues.apache.org/jira/browse/SPARK-8571 > Project: Spark > Issue Type: Bug > Components: Build, DStreams > Environment: centos 6.6 amplab build system >Reporter: shane knapp >Assignee: shane knapp >Priority: Minor > Labels: build, test > > over the past 3 months i've been noticing that there are occasionally hanging > processes on our build system workers after various spark builds have > finished. these are all spark streaming processes. > today i noticed a 3+ hour spark build that was timed out after 200 minutes > (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), > and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on > amp-jenkins-worker-02. after the timeout, it left the following process (and > all of it's children) hanging. > the process' CLI command was: > {quote} > [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 > jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java > -Dderby.system.durability=test -Djava.awt.headless=true > -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp > -Dspark.driver.allowMultipleContexts=true > -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos > -Dspark.testing=1 -Dspark.ui.enabled=false > -Dspark.ui.showConsoleProgress=false > -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming > -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > org.scalatest.tools.Runner -R > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes > > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes > -o -f > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt > -u > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. > {quote} > stracing that process doesn't give us much: > {quote} > [root@amp-jenkins-worker-02 ~]# strace -p 1714 > Process 1714 attached - interrupt to quit > futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL > {quote} > stracing it's children gives is a *little* bit more... some loop like this: > {quote} > > futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 > futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, > ) = -1 ETIMEDOUT (Connection timed out) > {quote} > and others loop on prtrace_attach (no such process) or restart_syscall > (resuming interrupted call) > even though this behavior has been solidly pinned to jobs timing out (which > ends w/an aborted, not failed, build), i've seen it happen for failed builds > as well. if i see any hanging processes from failed (not aborted) builds, i > will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223226#comment-16223226 ] Xin Lu commented on SPARK-8571: --- still an issue? > spark streaming hanging processes upon build exit > - > > Key: SPARK-8571 > URL: https://issues.apache.org/jira/browse/SPARK-8571 > Project: Spark > Issue Type: Bug > Components: Build, DStreams > Environment: centos 6.6 amplab build system >Reporter: shane knapp >Assignee: shane knapp >Priority: Minor > Labels: build, test > > over the past 3 months i've been noticing that there are occasionally hanging > processes on our build system workers after various spark builds have > finished. these are all spark streaming processes. > today i noticed a 3+ hour spark build that was timed out after 200 minutes > (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), > and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on > amp-jenkins-worker-02. after the timeout, it left the following process (and > all of it's children) hanging. > the process' CLI command was: > {quote} > [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 > jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java > -Dderby.system.durability=test -Djava.awt.headless=true > -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp > -Dspark.driver.allowMultipleContexts=true > -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos > -Dspark.testing=1 -Dspark.ui.enabled=false > -Dspark.ui.showConsoleProgress=false > -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming > -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > org.scalatest.tools.Runner -R > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes > > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes > -o -f > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt > -u > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. > {quote} > stracing that process doesn't give us much: > {quote} > [root@amp-jenkins-worker-02 ~]# strace -p 1714 > Process 1714 attached - interrupt to quit > futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL > {quote} > stracing it's children gives is a *little* bit more... some loop like this: > {quote} > > futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 > futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, > ) = -1 ETIMEDOUT (Connection timed out) > {quote} > and others loop on prtrace_attach (no such process) or restart_syscall > (resuming interrupted call) > even though this behavior has been solidly pinned to jobs timing out (which > ends w/an aborted, not failed, build), i've seen it happen for failed builds > as well. if i see any hanging processes from failed (not aborted) builds, i > will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742939#comment-14742939 ] shane knapp commented on SPARK-8571: yes, it is. i'll post an update this coming week. > spark streaming hanging processes upon build exit > - > > Key: SPARK-8571 > URL: https://issues.apache.org/jira/browse/SPARK-8571 > Project: Spark > Issue Type: Bug > Components: Build, Streaming > Environment: centos 6.6 amplab build system >Reporter: shane knapp >Assignee: shane knapp >Priority: Minor > Labels: build, test > > over the past 3 months i've been noticing that there are occasionally hanging > processes on our build system workers after various spark builds have > finished. these are all spark streaming processes. > today i noticed a 3+ hour spark build that was timed out after 200 minutes > (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), > and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on > amp-jenkins-worker-02. after the timeout, it left the following process (and > all of it's children) hanging. > the process' CLI command was: > {quote} > [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 > jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java > -Dderby.system.durability=test -Djava.awt.headless=true > -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp > -Dspark.driver.allowMultipleContexts=true > -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos > -Dspark.testing=1 -Dspark.ui.enabled=false > -Dspark.ui.showConsoleProgress=false > -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming > -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > org.scalatest.tools.Runner -R > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes > > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes > -o -f > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt > -u > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. > {quote} > stracing that process doesn't give us much: > {quote} > [root@amp-jenkins-worker-02 ~]# strace -p 1714 > Process 1714 attached - interrupt to quit > futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL > {quote} > stracing it's children gives is a *little* bit more... some loop like this: > {quote} > > futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 > futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, > ) = -1 ETIMEDOUT (Connection timed out) > {quote} > and others loop on prtrace_attach (no such process) or restart_syscall > (resuming interrupted call) > even though this behavior has been solidly pinned to jobs timing out (which > ends w/an aborted, not failed, build), i've seen it happen for failed builds > as well. if i see any hanging processes from failed (not aborted) builds, i > will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741824#comment-14741824 ] Tathagata Das commented on SPARK-8571: -- Is this still an issue? > spark streaming hanging processes upon build exit > - > > Key: SPARK-8571 > URL: https://issues.apache.org/jira/browse/SPARK-8571 > Project: Spark > Issue Type: Bug > Components: Build, Streaming > Environment: centos 6.6 amplab build system >Reporter: shane knapp >Assignee: shane knapp >Priority: Minor > Labels: build, test > > over the past 3 months i've been noticing that there are occasionally hanging > processes on our build system workers after various spark builds have > finished. these are all spark streaming processes. > today i noticed a 3+ hour spark build that was timed out after 200 minutes > (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), > and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on > amp-jenkins-worker-02. after the timeout, it left the following process (and > all of it's children) hanging. > the process' CLI command was: > {quote} > [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 > jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java > -Dderby.system.durability=test -Djava.awt.headless=true > -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp > -Dspark.driver.allowMultipleContexts=true > -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos > -Dspark.testing=1 -Dspark.ui.enabled=false > -Dspark.ui.showConsoleProgress=false > -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming > -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m > org.scalatest.tools.Runner -R > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes > > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes > -o -f > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt > -u > /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. > {quote} > stracing that process doesn't give us much: > {quote} > [root@amp-jenkins-worker-02 ~]# strace -p 1714 > Process 1714 attached - interrupt to quit > futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL > {quote} > stracing it's children gives is a *little* bit more... some loop like this: > {quote} > > futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 > futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 > futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, > {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 > futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, > ) = -1 ETIMEDOUT (Connection timed out) > {quote} > and others loop on prtrace_attach (no such process) or restart_syscall > (resuming interrupted call) > even though this behavior has been solidly pinned to jobs timing out (which > ends w/an aborted, not failed, build), i've seen it happen for failed builds > as well. if i see any hanging processes from failed (not aborted) builds, i > will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631467#comment-14631467 ] Josh Rosen commented on SPARK-8571: --- I don't think that bumping up the build timeouts will necessarily help us here. Do you think that the hanging process is somehow detaching itself from its parents' signal handlers (or whatever) such that it never receives the SIGTERM? spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631497#comment-14631497 ] shane knapp commented on SPARK-8571: another one: /usr/java/latest/bin/java -Xmx3g -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.2/label/centos/target/tmp -Dspark.test.home=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.2/label/centos -Dspark.testing=1 -Dspark.port.maxRetries=100 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dspark.driver.allowMultipleContexts=true -Dspark.unsafe.exceptionOnMemoryLeak=true -Dsun.io.serialization.extendedDebugInfo=true -Dderby.system.durability=test -ea -Xmx3g -Xss4096k -XX:PermSize=128M -XX:MaxNewSize=256m -XX:MaxPermSize=1g (i removed the classpath for brevity) spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631493#comment-14631493 ] shane knapp commented on SPARK-8571: that's literally the only thing that can be happening here, re process detaching. spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626946#comment-14626946 ] shane knapp commented on SPARK-8571: alright, more updates: this is still happening, though w/much less frequency. i discovered a hanging process on amp-jenkins-worker-01, which was the hadoop 2.3 matrix build spawned by Spark-SBT-Master. this particular build timed out after three hours, and automatically aborted even though it was still running: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2941/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/ i looked at the jenkins spec for builds being aborted, and didn't get very far: https://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build TL;DR: it uses java.lang.UnixProcess.destroyProcess, which send a SIGTERM to the builds. somehow this isn't actually killing everything. one possible solution is to up the timeout by another ~30 minutes, but i don't think that'll necessarily fix the problem. [~joshrosen] thoughts? ps- that hanging process is still running on amp-jenkins-worker-01: PID 120943 spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622924#comment-14622924 ] shane knapp commented on SPARK-8571: ok, changes have been made to the build configs... i'll keep an eye on these and make sure they're working as intended. i'll mark this resolved once i'm certain we're in good shape. spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619098#comment-14619098 ] shane knapp commented on SPARK-8571: ok, upon auditing all of the spark builds, i think i found the culprits: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-with-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-pre-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-with-YARN/configure https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-pre-YARN/configure * #!/bin/bash is NOT set (defaulting to behavior where if mvn fails, the lsof/xargs kill commands will never run) * the lsof/xargs kill line will potentially pollute the exit code of the build block * set -e looks to be impossible to set due to the lsof/xargs kill being the last line of the block proposed: * store the retcodes of the mvn commands, and if either one fails, fail the build after the lsof/xargs kill command * add #!/bin/bash spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe,
[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit
[ https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619135#comment-14619135 ] shane knapp commented on SPARK-8571: basically the code would look something like: #!/bin/bash rm -rf ./work git clean -fdx export BLAH build/mvn BLAH BLAH retcode1=$? build/mvn WHEE ZOMG retcode2=$? lsof | xargs kill if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then exit 1 fi spark streaming hanging processes upon build exit - Key: SPARK-8571 URL: https://issues.apache.org/jira/browse/SPARK-8571 Project: Spark Issue Type: Bug Components: Build, Streaming Environment: centos 6.6 amplab build system Reporter: shane knapp Assignee: shane knapp Priority: Minor Labels: build, test over the past 3 months i've been noticing that there are occasionally hanging processes on our build system workers after various spark builds have finished. these are all spark streaming processes. today i noticed a 3+ hour spark build that was timed out after 200 minutes (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/), and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on amp-jenkins-worker-02. after the timeout, it left the following process (and all of it's children) hanging. the process' CLI command was: {quote} [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714 jenkins1714 733 2.7 21342148 3642740 ?Sl 07:52 1713:41 java -Dderby.system.durability=test -Djava.awt.headless=true -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp -Dspark.driver.allowMultipleContexts=true -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos -Dspark.testing=1 -Dspark.ui.enabled=false -Dspark.ui.showConsoleProgress=false -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m org.scalatest.tools.Runner -R /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes -o -f /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt -u /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/. {quote} stracing that process doesn't give us much: {quote} [root@amp-jenkins-worker-02 ~]# strace -p 1714 Process 1714 attached - interrupt to quit futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL {quote} stracing it's children gives is a *little* bit more... some loop like this: {quote} snip futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, ) = -1 ETIMEDOUT (Connection timed out) {quote} and others loop on prtrace_attach (no such process) or restart_syscall (resuming interrupted call) even though this behavior has been solidly pinned to jobs timing out (which ends w/an aborted, not failed, build), i've seen it happen for failed builds as well. if i see any hanging processes from failed (not aborted) builds, i will investigate them and update this bug as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org