Re: [squid-dev] Strategy about build farm nodes

Amos Jeffries Sun, 16 May 2021 00:39:17 -0700

On 4/05/21 2:29 am, Alex Rousskov wrote:

On 5/3/21 12:41 AM, Francesco Chemolli wrote:

- we want our QA environment to match what users will use. For this
reason, it is not sensible that we just stop upgrading our QA nodes,


I see flaws in reasoning, but I do agree with the conclusion -- yes, we
should upgrade QA nodes. Nobody has proposed a ban on upgrades AFAICT!

The principles I have proposed allow upgrades that do not violate key
invariants. For example, if a proposed upgrade would break master, then
master has to be changed _before_ that upgrade actually happens, not
after. Upgrades must not break master.

So ... a node is added/upgraded. It runs and builds master fine. Thenadded to the matrices some of the PRs start failing.

*THAT* is the situation I see happening recently. Master itself workingfine and "huge amounts of pain, the sky is falling" complaints from acouple of people.

Sky is not falling. Master is no more nor less broken and buggy than itwas before sysadmin touched Jenkins.

The PR itself is no more, nor less, "broken" than it would be if forexample - it was only tested on Linux nodes and fails to compile onWindows. As the case for master *right now* happens to be.


What this means in terms of sysadmin steps for doing upgrades is up to
you. You are doing the hard work here, so you can optimize it the way
that works best for _you_. If really necessary, I would not even object
to trial upgrades (that may break master for an hour or two) as long as
you monitor the results and undo the breaking changes quickly and
proactively (without relying on my pleas to fix Jenkins to detect
breakages). I do not know what is feasible and what the best options
are, but, again, it is up to _you_ how to optimize this (while observing
the invariants).

Uhm. Respectfully, from my perspective the above paragraph conflictsdirectly with actions taken.

From what I can tell kinkie (as sysadmin) *has* been making a new nodeand testing it first. Not just against master but the main branches andmost active PRs before adding it for the *post-merge* matrix testingsnapshot production.


  But still threads like this one with complaints appear.

I understand there is some specific pain you have encountered to triggerthe complaint. Can we get down to documenting as exactly as possiblewhat the particular pain was?

Much of the processes we are discussing are scripted automation nothuman processing mistakes. Handling such pain points as bugs withbugzilla "Project" section would be best. Re-designing the entire systempolicy just moves us all to another set of unknown bugs when the scriptsare re-coded to meet that policy.

- I believe we should define four tiers of runtime environments, and
reflect these in our test setup:

  1. current and stable (e.g. ubuntu-latest-lts).
  2. current (e.g. fedora 34)
  3. bleeding edge
  4. everything else - this includes freebsd and openbsd


I doubt this classification is important to anybody _outside_ this
discussion, so I am OK with whatever classification you propose to
satisfy your internal needs.


IIRC this is the 5th iteration of ground-up redesign for this wheel.

Test designs that do not fit into our merge and release process sequencehave proven time and again to be broken and painful to Alex when theyoperate as-designed. For the rest of us it is this constant re-build ofautomation which is the painful part.



A. dev pre-PR testing
   - random individual OS.
   - matrix of everything (anybranch-*-matrix)

B. PR submission testing
   - which OS for master (5-pr-test) ?
   - which OS for beta (5-pr-test) ?
   - which OS for stable (5-pr-test) ?

Are all of those sets the same identical OS+compilers? no.
Why are they forced to be the same matrix test?
  IIRC, policy forced on sysadmin with previous pain complaints.

Are we getting painful experiences from this?

Yes. Lack of branch-specific testing before D on beta and stablecauses those branches to break a lot more often at last-minute beforereleases than master. Adding random days/weeks to each scheduled release.



C. merge testing
   - which OS for master (5-pr-auto) ?
   - which OS for beta (5-pr-auto) ?
   - which OS for stable (5-pr-auto) ?
     NP: maintainer does manual override on beta/stable merges.

Are all of those sets the same identical OS+compilers? no.
  Why are they forced to be the same matrix test? Anubis

Are we getting painful experiences from this? yes. see (B).


D. pre-release testing (snapshots + formal)
   - which OS for master (trunk-matrix) ?
   - which OS for beta (5-matrix) ?
   - which OS for stable (4-matrix) ?

Are all of those sets the same identical OS+compilers? no.
Are we forcing them to use the same matrix test? no.
Are we getting painful experiences from this? maybe.

Most loud complaints have been about "breaking master" which is themost volatile branch testing on the most volatile OS.

FTR: the reason all those matrices have '5-' prefix is because severalredesigns ago the system was that master/trunk had a matrix which thesysadmin added nodes to as OS upgraded. During branching vN themaintainer would clone/freeze that matrix into an N-foo which would beused to test the code against OS+compilers which the code in the vNbranch was designed to build on.

Can we have the people claiming pain specify exactly what the pain iscoming from, and let the sysadmin/developer(s) with specializedknowledge of the automation in that area decide how best to fix it?

I believe we should focus on the first two tiers for our merge workflow,
but then expect devs to fix any breakages in the third and fourth tiers
if caused by their PR,


FWIW, I do not understand what "focus" implies in this statement, and
why developers should _not_ "fix any breakages" revealed by the tests in
the first two tiers.

The rules I have in mind use two natural tiers:

* If a PR cannot pass a required CI test, that PR has to change before
it can be merged.

* If a PR cannot pass an optional CI test, it is up to PR author and
reviewers to decide what to do next.


That is already the case. Already well documented and understood.

I see no need to change anything based on those criteria. Ergo you havesome undeclared criteria leading to whatever pain triggered thisdiscussion. Maybe the pain is some specific bug that does not need awhole discussion and re-design by committee?


These are very simple rules that do not require developer knowledge of
any complex test node tiers that we might define/use internally.

This is the first I've heard about dev having to have such knowledge.Maybe because they are already *how we do things*. Red-herring argument?

Needless to say, the rules assume that the tests themselves are correct.
If not, the broken tests need to be fixed (by the Squid Project) before
the first bullet/rule above can be meaningfully applied (the second one
is flexible enough to allow PR author and reviewers to ignore optional
test failures).

There is a hidden assumption here too. About the test being appliedcorrectly.

I posit that is the real bug we need to sort out. We could keep on"correcting" the node sets (aka tests) back and forward between beingsuitable for master or suitable for release branches. That just shufflesthe pain from one end of the system to the other.

Make Anubis and Jenkins use different matrix for each branch at the Band C process stages above. Only then will discussion of what nodes toadd to what test/matrix actually make progress.

Breakages due to changes in nodes (e.g. introducing a new distro
version) would be on me and would not stop the merge workflow.


What you do internally to _avoid_ breakage is up to you, but the primary
goal is to _prevent_ CI breakage (rather than to keep CI nodes "up to
date"!).

The principle ("invariant" in Alex terminology?) with nodes is that theyrepresent the OS environment a typical developer can be assumed to berunning on that OS version+compiler combination.

Distros release security updates to their "stable" versions. Thereforeto stay true to the goal we require constant small upgrades as anongoing part of sysadmin maintenance.

Adding new nodes with next distro release versions is a manual processnot related to keeping existing nodes up to date (which is automated?).

From time to time distros break their own ability to compile things.This is to be expected on distros with rolling release and ironicallyLTS release (which get *less* testing of updates than normal releases).

It does not indicate "broken master" nor "broken CI" in any way.


There are many ways to break CI and detect those breakages, of course,
but if master cannot pass required tests after a CI change, then the
change broke CI.

I have yet to see the code in master be corrupted by CI changes in sucha way that it could not build on peoples development machines.

What we do have going on is network timeouts, DNS resolution, CPU waittimeouts, and rarely _automated_ CI upgrades all causing short-termfailure to pass a test.

A PR fixing newly highlighted bugs gets around the latter. Any pain (egmaster blocked for 2 days waiting on the fix PR to merge) is a normalproblem with that QA process and should not be attributed to the CI change.

What I would place on each individual dev is the case where a PR breaks
something in the trunk-matrix,trunk-arm32-matrix, trunk-arm64-matrix,
trunk-openbsd-matrix, trunk-freebsd-matrix builds, even if the 5-pr-test
and 5-pr-auto builds fail to detect the breakage because it happens on a
unstable or old platform. >

This feels a bit out of topic for me, but I think you are saying that
some CI tests called trunk-matrix, trunk-arm32-matrix,
trunk-arm64-matrix, trunk-openbsd-matrix, trunk-freebsd-matrix should be
classified as _required_.


That is how I read the statement too.

In other words, a PR must pass those CI tests
before it can be merged. Is that the situation today? Or are you
proposing some changes to the list of required CI tests? What are those
changes?

No, situation today is that those matrix are new ones only recentlycreated by sysadmin and not used for any of the merge or release processcriteria. The BSD though were once checked in the general 5-pr-testrequired for PR testing.

IMO, it's a good point. We do need to stop the practice of just droppingsupport for any OS where attempting to build finds existing bugs inmaster (aka "breaks master, sky falling"). More focus on fixing thosebugs to increase portability and grow the Squid community beyond thesubset of RHEL and Ubuntu users.



Amos
_______________________________________________
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev

Re: [squid-dev] Strategy about build farm nodes

Reply via email to