On 5/16/21 10:19 PM, squ...@treenet.co.nz wrote: > On 2021-05-17 11:56, Alex Rousskov wrote: >> On 5/16/21 3:31 AM, Amos Jeffries wrote: >>> On 4/05/21 2:29 am, Alex Rousskov wrote: >>>> The principles I have proposed allow upgrades that do not violate key >>>> invariants. For example, if a proposed upgrade would break master, then >>>> master has to be changed _before_ that upgrade actually happens, not >>>> after. Upgrades must not break master. >>> >>> So ... a node is added/upgraded. It runs and builds master fine. Then >>> added to the matrices some of the PRs start failing. >> >> It is easy to misunderstand what is going on because there is no good >> visualization of complex PR-master-Jenkins_nodes-Jenkins_failures >> relationships. Several kinds of PR test failures are possible. I will >> describe the two most relevant to your email: >> >> * PR test failures due to problems introduced by PRs should be welcomed >> at any time. > > Strawman here. This is both general statement and not relevant to CI > changes or design(s) we are discussing.
The first bullet (out of the two bullets that are meant to be interpreted together, in their context) is not even close to being a strawman argument by any definition I can find. Neither is the combination of those two bullets. >> CI improvements are allowed to find new bugs in open PRs. > IMO the crux is that word "new". CI improvements very rarely find new > bugs. What it actually finds and intentionally so is *existing bugs* the > old CI config wrongly ignored. Here, I used "new" to mean bugs that had not been found by the previous CI version (i.e. "newly discovered" or "new to us"). These bugs existed before and after CI improvements, of course -- neither master nor the PR has changed -- but we did not know about them. >> * PR test failures due to the existing master code are not welcomed. > That is not as black/white as the statement above implies. There are > some master branch bugs we don't want to block PRs merging, and there > are some (rarely) we absolutely do not want any PRs to change master > until fixed. Yes, master bugs should not affect PR merging in the vast majority of cases, but that is not what this bullet is about at all! This bullet is about (a certain kind of) PR test failures. Hopefully, we do not need to revisit the debate whether PRs with failed tests should be merged. They should not be merged, which is exactly why PR test failures caused by the combination of CI changes and the existing master code are not welcomed -- they block progress of an innocent PR. >> They represent a CI failure. > > IMO this is absolutely false. The whole point of improving CI is to find > those "existing" bugs which the previous CI config wrong missed. Your second sentence is correct, but it does not make my statement false. CI should achieve several goals. Finding bugs (i.e. blocking buggy PRs) is one of them. Merging (i.e. not blocking) PRs that should be merged is another. And there are a few more goals, of course. The problem described in my second bullet represents a CI failure to reach one of the key CI goals or a failure to maintain a critical CI invariant. It is a CI failure rather than a PR failure (the latter is covered by the first bullet). > e.g. v4+ currently do not build on Windows. We know this, but the > current CI testing does not show it. Upgrading the CI to include a test > for Windows is not a "CI failure". If such an upgrade would result in blocking PRs that do not touch Windows code, then that upgrade would be a CI failure. Or a failure to properly upgrade CI. >> In these cases, if the latest master code >> is tested with the same test after the problematic CI change, then that >> master test will fail. Nothing a PR can do in this situation can fix >> this kind of failure because it is not PR changes that are causing the >> failure -- CI changes broke the master branch, > > Ah. "broke the master branch" is a bit excessive. master is not broken > any more or less than it already was. If you can suggest a better short phrase to describe the problem, please do so! Until then, I will have to continue to use "breaking master" (in this context) simply for the lack of a better phrase. I tried to explain what that phrase means to avoid misunderstanding. I cannot find a better way to compress that explanation into a single phrase. > What is *actually* broken is the CI test results. One can easily argue that CI test results are actually "correct" in this case -- they correctly discover a bug in the code that wants to become the next master. The problem is in the assignment of responsibility: The PR did not introduce that bug, so the PR should not be punished for that bug. The only way to avoid such punishment (given the natural automated CI limitations) is to avoid breaking master tests, as previously defined. > Otherwise, short periods between sysadmin thinking it was a safe change > and reverting as breakage appeared is to be expected. Well, it is debatable whether frequent breakages should be _expected_ -- there are certainly ways to avoid the vast majority of them, but I have already agreed that we can survive breaking upgrade attempts, even relatively frequent ones, provided the admin doing the upgrade monitors CI and can quickly undo the attempts that break master, as previously defined. > I see here two distros which have "rolling release" being updated by > sysadmin from producing outdated and wrong test results, to producing > correct test results. This is a correct change in line with the goal of > our nodes representing what a user running that OS would see building > Squid master or PRs. In general, the CI change is incorrect if it results in breaking master, as previously defined. This particular correctness aspect does not depend on what tests the CI change was meant to fix. > IMO we can expect to occur on a regular basis I hope that Francesco will find a way to avoid creating this persistent expectation of these CI-caused problems! > We can resolve it by having those OS only > build in the N-matrix applied before releases, instead of the matrix > blocking PR tests or merging. AFAICT, that would not really _resolve_ the problem, only delay its manifestation until release time (when the stress related to fixing it will be extreme, and the maintainer will be tempted to abuse the pressure to release to push low-quality changes through the review process). > If we are all agreed, kinkie or I can implement ASAP. I would not strongly object to delaying N-matrix tests (whatever they will be) until release time, but delaying the pain until the release time sounds like a poor solution to me. I think the best solution here would heavily depend on who is responsible for adjusting the official code (to allow the anticipated CI upgrade). Their resources/preferences/capabilities/etc. may determine the best solution, the set of the required tests, and the corresponding rules. Today, nobody specific is responsible and many volunteer developers are often unaware of the failures or cannot quickly address them. With some luck, the changes that Francesco has started to propose will improve this situation. One solution that would be better, IMO, than delaying N-matrix tests would be to make tests on "rolling" OSes optional instead of required (but still test all PRs). Splitting merge tests into optional and required would attract developer attention without blocking innocent PRs. >>> B. PR submission testing >>> - which OS for master (5-pr-test) ? >>> - which OS for beta (5-pr-test) ? >>> - which OS for stable (5-pr-test) ? >>> >>> Are all of those sets the same identical OS+compilers? no. >>> Why are they forced to be the same matrix test? >> >> I do not understand the question. Are you asking why Jenkins uses the >> same 5-pr-test configuration for all three branches (master, beta, _and_ >> stable)? I do not know the answer. > So can we agree that they should be different tests? I, personally, cannot answer that specific question in a meaningful way, but I doubt I would strongly object to virtually any Jenkins changes related to stable and beta test sets. I continue to view those branches (but not their branching points!) as primarily maintainer's responsibility. If others do not object, you should feel free to make them different IMO. Just keep GitHub/Jenkins/Anubis integration in mind when you do so (e.g., by keeping the _number_ of GitHub-visible tests or "status checks" the same across all branches). >>> C. merge testing >>> - which OS for master (5-pr-auto) ? >>> - which OS for beta (5-pr-auto) ? >>> - which OS for stable (5-pr-auto) ? >>> NP: maintainer does manual override on beta/stable merges. >>> >>> Are all of those sets the same identical OS+compilers? no. >>> Why are they forced to be the same matrix test? Anubis >> >> This is too cryptic for me to understand, but Anubis does not force any >> tests on anybody -- it simply checks that the required tests have >> passed. I am not aware of any Anubis bugs in this area, but please >> correct me if I am wrong. > My understanding was that Anubis only has ability to check PRs against > its auto branch which tracks master. Anubis documentation[1] defines staging_branch (i.e. our "auto") as a branch for testing PR changes as if they were merged into the PR target branch. The target branch is defined by each GitHub PR. AFAICT, Anubis does not really have a high-level concept of a "master" branch. The auto branch should contain whatever target branch the PR is using. [1] https://github.com/measurement-factory/anubis/blob/master/README.md > Ability to have it track other > non-master branches and merge there is not available for use. AFAICT, Anubis is tracking v4 PRs but you are ignoring it. For example, merged PR #815 has a "waiting for more votes" status check added by Anubis to the last commit (f6828ed): https://github.com/squid-cache/squid/pull/815 There may be some integration problems that I am not aware of, but I think everything should work in principle. AFAICT, you are forcing manual v4/v5 commits instead of following the procedure we use for master, but that is your decision. Please note that I am not saying that your decision is right or wrong, just documenting my understanding and observations. > IMO we should look into this. But it is a technical project for sysadmin > + Eduard to coordinate. Not a policy thing. What tests are blocking and what are optional is a policy thing. Whether it is OK to break master, as previously defined, is a policy thing. How to implement those and other policies is usually a technicality indeed. >> AFAICT, Francesco and I are on the same page regarding not breaking >> master anymore -- he graciously agreed to prevent such breakages in the >> future, and I am very thankful that he did. Based on your comments >> discussing several cases where such master breakage is, in your opinion, >> OK, you currently disagree with that principle. I do not know why. > I think we differ in our definitions of "breaking master". You seem to > be including breakage of things in the CI system itself which I consider > outside of "master", or expected results of normal sysadmin activity. I hope my definition of "breaking master" in this CI change context is clear by now. I still do not know why you oppose to prohibiting doing that, under any name. > I hope my response to the two use-cases you present at the top of this > email clarify. Unfortunately, those attacks on my choice of words did not help me understand why you are opposed to the principles those words describe. You are welcome to propose better phrasing, of course, but that would not change what actually happened and my desire to prohibit those kinds of events in the future. Alex. _______________________________________________ squid-dev mailing list squid-dev@lists.squid-cache.org http://lists.squid-cache.org/listinfo/squid-dev