[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2018-04-10 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432725#comment-16432725
 ] 

shane knapp commented on SPARK-8571:


just doing some email archaeology and found this.

no, it's not an issue anymore.

 

> spark streaming hanging processes upon build exit
> -
>
> Key: SPARK-8571
> URL: https://issues.apache.org/jira/browse/SPARK-8571
> Project: Spark
>  Issue Type: Bug
>  Components: Build, DStreams
> Environment: centos 6.6 amplab build system
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Minor
>  Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> 
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2017-10-27 Thread Xin Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16223226#comment-16223226
 ] 

Xin Lu commented on SPARK-8571:
---

still an issue?

> spark streaming hanging processes upon build exit
> -
>
> Key: SPARK-8571
> URL: https://issues.apache.org/jira/browse/SPARK-8571
> Project: Spark
>  Issue Type: Bug
>  Components: Build, DStreams
> Environment: centos 6.6 amplab build system
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Minor
>  Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> 
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-09-13 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742939#comment-14742939
 ] 

shane knapp commented on SPARK-8571:


yes, it is.  i'll post an update this coming week.



> spark streaming hanging processes upon build exit
> -
>
> Key: SPARK-8571
> URL: https://issues.apache.org/jira/browse/SPARK-8571
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
> Environment: centos 6.6 amplab build system
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Minor
>  Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> 
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-09-11 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741824#comment-14741824
 ] 

Tathagata Das commented on SPARK-8571:
--

Is this still an issue?

> spark streaming hanging processes upon build exit
> -
>
> Key: SPARK-8571
> URL: https://issues.apache.org/jira/browse/SPARK-8571
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
> Environment: centos 6.6 amplab build system
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Minor
>  Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> 
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-17 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631467#comment-14631467
 ] 

Josh Rosen commented on SPARK-8571:
---

I don't think that bumping up the build timeouts will necessarily help us here.

Do you think that the hanging process is somehow detaching itself from its 
parents' signal handlers (or whatever) such that it never receives the SIGTERM?

 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-17 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631497#comment-14631497
 ] 

shane knapp commented on SPARK-8571:


another one:
/usr/java/latest/bin/java -Xmx3g 
-Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.2/label/centos/target/tmp
 
-Dspark.test.home=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.2/label/centos
 -Dspark.testing=1 -Dspark.port.maxRetries=100 -Dspark.ui.enabled=false 
-Dspark.ui.showConsoleProgress=false -Dspark.driver.allowMultipleContexts=true 
-Dspark.unsafe.exceptionOnMemoryLeak=true 
-Dsun.io.serialization.extendedDebugInfo=true -Dderby.system.durability=test 
-ea -Xmx3g -Xss4096k -XX:PermSize=128M -XX:MaxNewSize=256m -XX:MaxPermSize=1g

(i removed the classpath for brevity)

 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-17 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631493#comment-14631493
 ] 

shane knapp commented on SPARK-8571:


that's literally the only thing that can be happening here, re process 
detaching.

 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-14 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626946#comment-14626946
 ] 

shane knapp commented on SPARK-8571:


alright, more updates:

this is still happening, though w/much less frequency.  i discovered a hanging 
process on amp-jenkins-worker-01, which was the hadoop 2.3 matrix build spawned 
by Spark-SBT-Master.  this particular build timed out after three hours, and 
automatically aborted even though it was still running:  
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2941/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=centos/

i looked at the jenkins spec for builds being aborted, and didn't get very far: 
 https://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build

TL;DR:  it uses java.lang.UnixProcess.destroyProcess, which send a SIGTERM to 
the builds.  somehow this isn't actually killing everything.

one possible solution is to up the timeout by another ~30 minutes, but i don't 
think that'll necessarily fix the problem.  [~joshrosen] thoughts?

ps- that hanging process is still running on amp-jenkins-worker-01:  PID 120943

 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-10 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622924#comment-14622924
 ] 

shane knapp commented on SPARK-8571:


ok, changes have been made to the build configs...  i'll keep an eye on these 
and make sure they're working as intended.  i'll mark this resolved once i'm 
certain we're in good shape.

 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-08 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619098#comment-14619098
 ] 

shane knapp commented on SPARK-8571:


ok, upon auditing all of the spark builds, i think i found the culprits:

https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-Maven-pre-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-with-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.4-Maven-pre-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-with-YARN/configure
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.3-Maven-pre-YARN/configure

* #!/bin/bash is NOT set (defaulting to behavior where if mvn fails, the 
lsof/xargs kill commands will never run)
* the lsof/xargs kill line will potentially pollute the exit code of the build 
block
* set -e looks to be impossible to set due to the lsof/xargs kill being the 
last line of the block

proposed:
* store the retcodes of the mvn commands, and if either one fails, fail the 
build after the lsof/xargs kill command
* add #!/bin/bash



 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, 

[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-07-08 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619135#comment-14619135
 ] 

shane knapp commented on SPARK-8571:


basically the code would look something like:

#!/bin/bash
rm -rf ./work
git clean -fdx

export BLAH
build/mvn BLAH BLAH
retcode1=$?
build/mvn WHEE ZOMG
retcode2=$?

lsof | xargs kill

if [[ $retcode1 -ne 0 || $retcode2 -ne 0 ]]; then
  exit 1
fi

 spark streaming hanging processes upon build exit
 -

 Key: SPARK-8571
 URL: https://issues.apache.org/jira/browse/SPARK-8571
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
 Environment: centos 6.6 amplab build system
Reporter: shane knapp
Assignee: shane knapp
Priority: Minor
  Labels: build, test

 over the past 3 months i've been noticing that there are occasionally hanging 
 processes on our build system workers after various spark builds have 
 finished.  these are all spark streaming processes.
 today i noticed a 3+ hour spark build that was timed out after 200 minutes 
 (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
 amp-jenkins-worker-02.  after the timeout, it left the following process (and 
 all of it's children) hanging.
 the process' CLI command was:
 {quote}
 [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
 jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
 -Dderby.system.durability=test -Djava.awt.headless=true 
 -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
  -Dspark.driver.allowMultipleContexts=true 
 -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
  -Dspark.testing=1 -Dspark.ui.enabled=false 
 -Dspark.ui.showConsoleProgress=false 
 -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
 org.scalatest.tools.Runner -R 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
  
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
  -o -f 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
  -u 
 /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
 {quote}
 stracing that process doesn't give us much:
 {quote}
 [root@amp-jenkins-worker-02 ~]# strace -p 1714
 Process 1714 attached - interrupt to quit
 futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
 {quote}
 stracing it's children gives is a *little* bit more...  some loop like this:
 {quote}
 snip
 futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
 futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
 futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
 {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
 futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
 futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
 ) = -1 ETIMEDOUT (Connection timed out)
 {quote}
 and others loop on prtrace_attach (no such process) or restart_syscall 
 (resuming interrupted call)
 even though this behavior has been solidly pinned to jobs timing out (which 
 ends w/an aborted, not failed, build), i've seen it happen for failed builds 
 as well.  if i see any hanging processes from failed (not aborted) builds, i 
 will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org