[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza updated JENKINS-50405 Jenkins / JENKINS-50405 runATH leads to deadlock of resource consumption for core PR builds Change By: Raul Arabaolaza Status: In Review Resolved Resolution: Fixed Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title R. Tyler Croy commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds I believe this is safe to close up now Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds PR is merged and ATH is working again, see here Also according to logs there are only three nodes used (as expected). One for linux, another for windows and the last one for ath, so it seems the contention issue is also fixed. I am going to keep this in review two or three days just in case and close if no problems arise Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds PR#34 should fix the issue, basically what I have done is change ensureInNode so it takes a comma-separated list of labels and checks that all of those labels are present individually in the current node, if not it allocates a new node with the labels joined by "&&". Caveats, is still unable to deal with complex labels, but that functionality is not needed to run on the current infra as you can simply enclose the entire runATH call in a node("docker&") as the Jenkinsfile for core does, whith the changes in #34 no node allocation will be done at all and no node will be blocked waiting for another one Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds In the meanwhile, I can just make the ensureInNode accept a list of labels and check that all are present in the current node Andrew Bayer Do you believe an ensureInNode step able to deal with label expressions could be an interesting addition for workflow.durable-task-step plugin? Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Andrew Bayer commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds So it appears that ensureInNode in runATH.groovy doesn't handle label expressions (i.e., docker && highmem), just label atoms (i.e., highmem), since it's looking for the literal string docker && highmem in the NODE_LABELS environment variable...which is just a space-delimited list of the individual label atoms on the node. So, e.g., if the only two labels on the node are docker and highmem, then NODE_LABELS is docker highmem. Which obviously doesn't contain docker && highmem. This doesn't create a deadlock per se, but it does double up the executor usage per run, with one nested within another. I'm not sure what the solution is, exactly - in this particular case, it's actually pretty simple - just switch to highmem alone, since that'll get you the same thing as docker && highmem, but some smarter logic for determining what node you're on and what node you want to be on would be handy. However, that probably involves diving into the core label logic to do parsing/comparing/etc, and that is not shared library material. Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title SCM/JIRA link daemon commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds Code changed in jenkins User: R. Tyler Croy Path: Jenkinsfile http://jenkins-ci.org/commit/jenkins/0ca03d89c7e3a2b7855965e79b84dac2c0052119 Log: Merge pull request #3371 from raul-arabaolaza/JENKINS-50405-Quick_fix JENKINS-50405 Run the entire thing in docker && highmem node Compare: https://github.com/jenkinsci/jenkins/compare/9f599911f612...0ca03d89c7e3 Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title SCM/JIRA link daemon commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds Code changed in jenkins User: Raul Arabaolaza Path: Jenkinsfile http://jenkins-ci.org/commit/jenkins/9f8b5d691e3d11d65625497a1b876e1d47c466d0 Log: JENKINS-50405 Run the entire thing in docker && highmem node Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza started work on JENKINS-50405 Change By: Raul Arabaolaza Status: Open In Progress Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza updated JENKINS-50405 Jenkins / JENKINS-50405 runATH leads to deadlock of resource consumption for core PR builds Change By: Raul Arabaolaza Status: In Progress Review Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title R. Tyler Croy updated an issue Jenkins / JENKINS-50405 runATH leads to deadlock of resource consumption for core PR builds Change By: R. Tyler Croy Component/s: essentials Sprint: Essentials - Milestone 1 Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds So, after a talk with R. Tyler Croy we are not going to disable this yet, I am going to start to think in a better way to orchestrate nodes so I can minimize resource contention, seems like it has been properly running all week and the problem was triggered by a bunch of very quick merges into core Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title Raul Arabaolaza commented on JENKINS-50405 Re: runATH leads to deadlock of resource consumption for core PR builds So, as a quick fix, I am going to create a quick PR to just do not run the ath for the moment. As a better solution, I have to find a way to liberate the "linux" node while is waiting for the docker ones if possible and if not just make sure the full runATH is executed on docker& Add Comment This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e) -- You received this message because you are subscribed to the Google Groups "Jenkins Issues" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-issues+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
[JIRA] (JENKINS-50405) runATH leads to deadlock of resource consumption for core PR builds
Title: Message Title R. Tyler Croy created an issue Jenkins / JENKINS-50405 runATH leads to deadlock of resource consumption for core PR builds Issue Type: Bug Assignee: Raul Arabaolaza Components: acceptance-test-harness Created: 2018-03-26 14:28 Priority: Major Reporter: R. Tyler Croy This weekend we experienced a denial-of-service on ci.jenkins.io due to this resource contention caused by the runATH step in the core Jenkinsfile. Basically, an executor on the "linux" label was occupied while blocking and waiting for an executor on "docker&". When Jenkins couldn't provision "highmem" due to capacity issues, the runATH step blocks the "linux" executor indefinitely. At the bottom of the Jenkinsfile for core, is some code along these lines: node('linux') { /* some setup */ runAth() } In runATH(), the first ensureInNode statement ensure that the Pipeline only uses on node, since the execution is already in a "linux" NODE_LABEL. When the second ensureInNode executes, it's attempting to ensure that the execution is in docker&, which it is of course not. This causes Pipeline to block waiting for this node, while occupying the outer "linux" node declaration. This is kind of a big problem and will cause additional resource contention whenever more than one or two core PRs are merged in quick succession.