On 4/05/21 2:29 am, Alex Rousskov wrote:
On 5/3/21 12:41 AM, Francesco Chemolli wrote:
- we want our QA environment to match what users will use. For this
reason, it is not sensible that we just stop upgrading our QA nodes,

I see flaws in reasoning, but I do agree with the conclusion -- yes, we
should upgrade QA nodes. Nobody has proposed a ban on upgrades AFAICT!

The principles I have proposed allow upgrades that do not violate key
invariants. For example, if a proposed upgrade would break master, then
master has to be changed _before_ that upgrade actually happens, not
after. Upgrades must not break master.

So ... a node is added/upgraded. It runs and builds master fine. Then added to the matrices some of the PRs start failing.

*THAT* is the situation I see happening recently. Master itself working fine and "huge amounts of pain, the sky is falling" complaints from a couple of people.

Sky is not falling. Master is no more nor less broken and buggy than it was before sysadmin touched Jenkins.

The PR itself is no more, nor less, "broken" than it would be if for example - it was only tested on Linux nodes and fails to compile on Windows. As the case for master *right now* happens to be.



What this means in terms of sysadmin steps for doing upgrades is up to
you. You are doing the hard work here, so you can optimize it the way
that works best for _you_. If really necessary, I would not even object
to trial upgrades (that may break master for an hour or two) as long as
you monitor the results and undo the breaking changes quickly and
proactively (without relying on my pleas to fix Jenkins to detect
breakages). I do not know what is feasible and what the best options
are, but, again, it is up to _you_ how to optimize this (while observing
the invariants).


Uhm. Respectfully, from my perspective the above paragraph conflicts directly with actions taken.

From what I can tell kinkie (as sysadmin) *has* been making a new node and testing it first. Not just against master but the main branches and most active PRs before adding it for the *post-merge* matrix testing snapshot production.

  But still threads like this one with complaints appear.



I understand there is some specific pain you have encountered to trigger the complaint. Can we get down to documenting as exactly as possible what the particular pain was?

Much of the processes we are discussing are scripted automation not human processing mistakes. Handling such pain points as bugs with bugzilla "Project" section would be best. Re-designing the entire system policy just moves us all to another set of unknown bugs when the scripts are re-coded to meet that policy.



- I believe we should define four tiers of runtime environments, and
reflect these in our test setup:

  1. current and stable (e.g. ubuntu-latest-lts).
  2. current (e.g. fedora 34)
  3. bleeding edge
  4. everything else - this includes freebsd and openbsd

I doubt this classification is important to anybody _outside_ this
discussion, so I am OK with whatever classification you propose to
satisfy your internal needs.


IIRC this is the 5th iteration of ground-up redesign for this wheel.

Test designs that do not fit into our merge and release process sequence have proven time and again to be broken and painful to Alex when they operate as-designed. For the rest of us it is this constant re-build of automation which is the painful part.


A. dev pre-PR testing
   - random individual OS.
   - matrix of everything (anybranch-*-matrix)

B. PR submission testing
   - which OS for master (5-pr-test) ?
   - which OS for beta (5-pr-test) ?
   - which OS for stable (5-pr-test) ?

Are all of those sets the same identical OS+compilers? no.
Why are they forced to be the same matrix test?
  IIRC, policy forced on sysadmin with previous pain complaints.

Are we getting painful experiences from this?
Yes. Lack of branch-specific testing before D on beta and stable causes those branches to break a lot more often at last-minute before releases than master. Adding random days/weeks to each scheduled release.


C. merge testing
   - which OS for master (5-pr-auto) ?
   - which OS for beta (5-pr-auto) ?
   - which OS for stable (5-pr-auto) ?
     NP: maintainer does manual override on beta/stable merges.

Are all of those sets the same identical OS+compilers? no.
  Why are they forced to be the same matrix test? Anubis

Are we getting painful experiences from this? yes. see (B).


D. pre-release testing (snapshots + formal)
   - which OS for master (trunk-matrix) ?
   - which OS for beta (5-matrix) ?
   - which OS for stable (4-matrix) ?

Are all of those sets the same identical OS+compilers? no.
Are we forcing them to use the same matrix test? no.
Are we getting painful experiences from this? maybe.
Most loud complaints have been about "breaking master" which is the most volatile branch testing on the most volatile OS.



FTR: the reason all those matrices have '5-' prefix is because several redesigns ago the system was that master/trunk had a matrix which the sysadmin added nodes to as OS upgraded. During branching vN the maintainer would clone/freeze that matrix into an N-foo which would be used to test the code against OS+compilers which the code in the vN branch was designed to build on.


Can we have the people claiming pain specify exactly what the pain is coming from, and let the sysadmin/developer(s) with specialized knowledge of the automation in that area decide how best to fix it?



I believe we should focus on the first two tiers for our merge workflow,
but then expect devs to fix any breakages in the third and fourth tiers
if caused by their PR,

FWIW, I do not understand what "focus" implies in this statement, and
why developers should _not_ "fix any breakages" revealed by the tests in
the first two tiers.

The rules I have in mind use two natural tiers:

* If a PR cannot pass a required CI test, that PR has to change before
it can be merged.

* If a PR cannot pass an optional CI test, it is up to PR author and
reviewers to decide what to do next.

That is already the case. Already well documented and understood.

I see no need to change anything based on those criteria. Ergo you have some undeclared criteria leading to whatever pain triggered this discussion. Maybe the pain is some specific bug that does not need a whole discussion and re-design by committee?



These are very simple rules that do not require developer knowledge of
any complex test node tiers that we might define/use internally.


This is the first I've heard about dev having to have such knowledge. Maybe because they are already *how we do things*. Red-herring argument?


Needless to say, the rules assume that the tests themselves are correct.
If not, the broken tests need to be fixed (by the Squid Project) before
the first bullet/rule above can be meaningfully applied (the second one
is flexible enough to allow PR author and reviewers to ignore optional
test failures).


There is a hidden assumption here too. About the test being applied correctly.

I posit that is the real bug we need to sort out. We could keep on "correcting" the node sets (aka tests) back and forward between being suitable for master or suitable for release branches. That just shuffles the pain from one end of the system to the other.

Make Anubis and Jenkins use different matrix for each branch at the B and C process stages above. Only then will discussion of what nodes to add to what test/matrix actually make progress.




Breakages due to changes in nodes (e.g. introducing a new distro
version) would be on me and would not stop the merge workflow.

What you do internally to _avoid_ breakage is up to you, but the primary
goal is to _prevent_ CI breakage (rather than to keep CI nodes "up to
date"!).

The principle ("invariant" in Alex terminology?) with nodes is that they represent the OS environment a typical developer can be assumed to be running on that OS version+compiler combination.

Distros release security updates to their "stable" versions. Therefore to stay true to the goal we require constant small upgrades as an ongoing part of sysadmin maintenance.

Adding new nodes with next distro release versions is a manual process not related to keeping existing nodes up to date (which is automated?).

From time to time distros break their own ability to compile things. This is to be expected on distros with rolling release and ironically LTS release (which get *less* testing of updates than normal releases).
It does not indicate "broken master" nor "broken CI" in any way.



There are many ways to break CI and detect those breakages, of course,
but if master cannot pass required tests after a CI change, then the
change broke CI.

I have yet to see the code in master be corrupted by CI changes in such a way that it could not build on peoples development machines.

What we do have going on is network timeouts, DNS resolution, CPU wait timeouts, and rarely _automated_ CI upgrades all causing short-term failure to pass a test.

A PR fixing newly highlighted bugs gets around the latter. Any pain (eg master blocked for 2 days waiting on the fix PR to merge) is a normal problem with that QA process and should not be attributed to the CI change.



What I would place on each individual dev is the case where a PR breaks
something in the trunk-matrix,trunk-arm32-matrix, trunk-arm64-matrix,
trunk-openbsd-matrix, trunk-freebsd-matrix builds, even if the 5-pr-test
and 5-pr-auto builds fail to detect the breakage because it happens on a
unstable or old platform. >
This feels a bit out of topic for me, but I think you are saying that
some CI tests called trunk-matrix, trunk-arm32-matrix,
trunk-arm64-matrix, trunk-openbsd-matrix, trunk-freebsd-matrix should be
classified as _required_.

That is how I read the statement too.

In other words, a PR must pass those CI tests
before it can be merged. Is that the situation today? Or are you
proposing some changes to the list of required CI tests? What are those
changes?


No, situation today is that those matrix are new ones only recently created by sysadmin and not used for any of the merge or release process criteria. The BSD though were once checked in the general 5-pr-test required for PR testing.


IMO, it's a good point. We do need to stop the practice of just dropping support for any OS where attempting to build finds existing bugs in master (aka "breaks master, sky falling"). More focus on fixing those bugs to increase portability and grow the Squid community beyond the subset of RHEL and Ubuntu users.


Amos
_______________________________________________
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev

Reply via email to