[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
Yoann Dubreuil commented on JENKINS-26947 Unattended wait in the remoting code Just created a PR: https://github.com/jenkinsci/maven-plugin/pull/39 This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
James Nord commented on JENKINS-26947 Unattended wait in the remoting code FWIW the original report has nothing to do with packet corruption - just the channel dying. You can get the same results with a "kill -9" on the slave. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
Yoann Dubreuil commented on JENKINS-26947 Unattended wait in the remoting code Yes that's right. I found the problem when playing with netem, hence the bug report. It's a bug in the Maven plugin. When upstream channel is closed, Maven channel stays around. Will post a PR shortly. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
James Nord edited a comment on JENKINS-26947 Unattended wait in the remoting code possibly a duplicate of JENKINS-10840 Soemthing strange is going on with Docker and tc. with 2 freestyle builds I see a failure and the salve is disconnected with. java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:985) at hudson.FilePath.act(FilePath.java:967) at hudson.FilePath.exists(FilePath.java:1435) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:549) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:751) at hudson.FilePath.act(FilePath.java:978) ... 9 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) at java.io.ObjectInputStream.init(ObjectInputStream.java:299) at hudson.remoting.ObjectInputStreamEx.init(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) ERROR: Socket connection to SSH server was lost java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489) at java.lang.Thread.run(Thread.java:745) But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application. The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
James Nord commented on JENKINS-26947 Unattended wait in the remoting code possibly a duplicate of JENKINS-10840 Soemthing strange is going on with Docker and tc. with 2 freestyle builds I see a failure and the salve is disconnected with. noformat java.io.IOException: remote file operation failed: /home/jenkins/data/jenkins-slave.exe at hudson.remoting.Channel@7407d0f5:docker_ssh: hudson.remoting.ChannelClosedException: channel is already closed at hudson.FilePath.act(FilePath.java:985) at hudson.FilePath.act(FilePath.java:967) at hudson.FilePath.exists(FilePath.java:1435) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:46) at org.jenkinsci.modules.windows_slave_installer.SlaveExeUpdater$1.call(SlaveExeUpdater.java:37) at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: hudson.remoting.ChannelClosedException: channel is already closed at hudson.remoting.Channel.send(Channel.java:549) at hudson.remoting.Request.call(Request.java:129) at hudson.remoting.Channel.call(Channel.java:751) at hudson.FilePath.act(FilePath.java:978) ... 9 more Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2325) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801) at java.io.ObjectInputStream.init(ObjectInputStream.java:299) at hudson.remoting.ObjectInputStreamEx.init(ObjectInputStreamEx.java:40) at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) ERROR: Socket connection to SSH server was lost java.io.IOException: Peer sent DISCONNECT message (reason code 2): Packet corrupt at com.trilead.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:766) at com.trilead.ssh2.transport.TransportManager$1.run(TransportManager.java:489) at java.lang.Thread.run(Thread.java:745) noformat But a single bit packet corruption should cause the packet to be thrown away by the OS layer due to a TCP checksum miss-match and not to be seen by the application. The other interesting thing is that a build can be runnign fine and it only dies when a new build is kicked off - I would not expect an issue in setting up a new channel in the multiplex (from what KK said) to fail the other channels. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
James Nord commented on JENKINS-26947 Unattended wait in the remoting code Have you disabled the PIngThread at all? AFAICT netem does not kill the connection - the remote end will be retransmitting the packets - and as such the channel is not closed. The PingThread should eventually notice this (10 minutes interval + 4 minute timeout) so after at most 14 minutes the connection should be killed and this thread unblock. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
Yoann Dubreuil commented on JENKINS-26947 Unattended wait in the remoting code No, I did not disable the ping thread. In fact, I did nothing special, just started a fresh Jenkins instance and connected it to this docker slave. I took the thread dump 30 minutes after the disconnection. Will relaunch the test this afternoon to see if it would ever times out or not. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
Yoann Dubreuil updated JENKINS-26947 Unattended wait in the remoting code Change By: Yoann Dubreuil (24/Feb/15 10:32 PM) Description: Ifindawaytotriggeraremotingproblemusingtcpfaultinjectionwithnetem.Imabletotriggerthiswaitcallathudson.remoting.Request.call(Request.java:146):{ { code} while(response==null!channel.isInClosed())//Idontknowexactlywhenthiscanhappen,aspendingCallsarecleanedupbyChannel,//butinproductionIveobservedthatinrareoccasionitcanblockforever,evenafterachannel//isgone.Sobedefensiveagainstthat.wait(30*1000); {code } } Whenthiswaitistriggered,therunningbuildisstuckandconsumesaexecutor.Itloopsoverandoveronthewait.Toreproduce,setupaSSHslaveusingtheattachedDockerfile,andsetupnetemonthedocker0bridgelikethis: {code} tcqdiscadddevdocker0rootnetemtcqdiscchangedevdocker0rootnetemcorrupt1 {code} Testingrequirestorunthejobonetimebeforeconfiguringnetem,asnetemsettingsareappliedtoallnetworkstreams,itcouldfailwhiledownloadingMavendependencies.IjustlaunchedaMavenbuildofaexampleprojecttotriggertheproblem.ItmightbeaMavenspecificproblem...Toremovenetemsettings,justruntcqdiscdeldevdocker0root.IveattachedtheDockerfile,thecommandIusedtolaunchitandathreaddumpofaJenkinsstuckmaster. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
Daniel Beck commented on JENKINS-26947 Unattended wait in the remoting code Is this a security issue? E.g. is this exploitable by third parties to disrupt network reachable Jenkins service? This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
Yoann Dubreuil commented on JENKINS-26947 Unattended wait in the remoting code You must be on the path of the network stream to be able to change the packet content. Even if you are able to get there, the SSH protocol protects the content of the stream. You would only be able to trigger a disconnection, but at this stage, I bet a lot of other network services are in danger. So for me, it's not a security issue. This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] [remoting] (JENKINS-26947) Unattended wait in the remoting code
Yoann Dubreuil created JENKINS-26947 Unattended wait in the remoting code Issue Type: Bug Assignee: Unassigned Attachments: Dockerfile, launch.sh, stacktrace.txt Components: remoting Created: 12/Feb/15 10:36 PM Description: I find a way to trigger a remoting problem using tcp fault injection with netem. I'm able to trigger this wait call at hudson.remoting.Request.call(Request.java:146): {{ while(response==null !channel.isInClosed()) // I don't know exactly when this can happen, as pendingCalls are cleaned up by Channel, // but in production I've observed that in rare occasion it can block forever, even after a channel // is gone. So be defensive against that. wait(30*1000); }} When this wait is triggered, the running build is stuck and consumes a executor. It loops over and over on the wait. To reproduce, setup a SSH slave using the attached Dockerfile, and setup netem on the docker0 bridge like this: tc qdisc add dev docker0 root netem tc qdisc change dev docker0 root netem corrupt 1 Testing requires to run the job one time before configuring netem, as netem settings are applied to all network streams, it could fail while downloading Maven dependencies. I just launched a Maven build of a example project to trigger the problem. It might be a Maven specific problem... To remove netem settings, just run tc qdisc del dev docker0 root. I've attached the Dockerfile, the command I used to launch it and a threaddump of a Jenkins stuck master. Environment: Linux Project: Jenkins Priority: Minor Reporter: Yoann Dubreuil This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators. For more information on JIRA, see: http://www.atlassian.com/software/jira -- You received this message because you are subscribed to the Google Groups Jenkins Issues group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.