On 2021-05-17 11:56, Alex Rousskov wrote:
On 5/16/21 3:31 AM, Amos Jeffries wrote:
On 4/05/21 2:29 am, Alex Rousskov wrote:
On 5/3/21 12:41 AM, Francesco Chemolli wrote:
- we want our QA environment to match what users will use. For this
reason, it is not sensible that we just stop upgrading our QA nodes,

I see flaws in reasoning, but I do agree with the conclusion -- yes, we should upgrade QA nodes. Nobody has proposed a ban on upgrades AFAICT!

The principles I have proposed allow upgrades that do not violate key
invariants. For example, if a proposed upgrade would break master, then
master has to be changed _before_ that upgrade actually happens, not
after. Upgrades must not break master.

So ... a node is added/upgraded. It runs and builds master fine. Then
added to the matrices some of the PRs start failing.

It is easy to misunderstand what is going on because there is no good
visualization of complex PR-master-Jenkins_nodes-Jenkins_failures
relationships. Several kinds of PR test failures are possible. I will
describe the two most relevant to your email:

* PR test failures due to problems introduced by PRs should be welcomed
at any time.

Strawman here. This is both general statement and not relevant to CI changes or design(s) we are discussing.

CI improvements are allowed to find new bugs in open PRs.

IMO the crux is that word "new". CI improvements very rarely find new bugs. What it actually finds and intentionally so is *existing bugs* the old CI config wrongly ignored.

Such findings, even when discovered at the "last minute", should be seen
as an overall positive event or progress -- our CI was able to identify
a problem before it got officially accepted! I do not recall anybody
complaining about such failures recently.


Conclusion being that due to the rarity of "new bugs" CI improvements very rarely get complained about due to them.


* PR test failures due to the existing master code are not welcomed.

That is not as black/white as the statement above implies. There are some master branch bugs we don't want to block PRs merging, and there are some (rarely) we absolutely do not want any PRs to change master until fixed.

They represent a CI failure.

IMO this is absolutely false. The whole point of improving CI is to find those "existing" bugs which the previous CI config wrong missed.

e.g. v4+ currently do not build on Windows. We know this, but the current CI testing does not show it. Upgrading the CI to include a test for Windows is not a "CI failure".


In these cases, if the latest master code
is tested with the same test after the problematic CI change, then that
master test will fail. Nothing a PR can do in this situation can fix
this kind of failure because it is not PR changes that are causing the
failure -- CI changes broke the master branch,

Ah. "broke the master branch" is a bit excessive. master is not broken any more or less than it already was.

What is *actually* broken is the CI test results.


not just the PR! This
kind of failures are the responsibility of CI administrators, and PR
authors should complain about them, especially when there are no signs
of CI administrators aware of and working on addressing the problem.


*IF* all the conditions and assumptions contained in that final sentence are true I would agree. Such case points to incompetence or neglect on part of the sysadmin who broken *the CI test* then abandoned fixing it - complaints are reasonable there.

 [ Is kinkie acting incompetently on a regular basis? I think no. ]

Otherwise, short periods between sysadmin thinking it was a safe change and reverting as breakage appeared is to be expected. That is why we have sysadmin doing advance notices for us all to be aware of CI changes planned. Complaints still happen, but not much reason to redesign the sysadmin practices and automation (which is yet more CI change, ...).


A good example of a failure of the second kind a -Wrange-loop-construct
error in a PR that does not touch any range loops (Jenkins conveniently
deleted the actual failed test, but my GitHub comment and PR contents
may be enough to restore what happened):
https://github.com/squid-cache/squid/pull/806#issuecomment-827924821


Thank you.

I see here two distros which have "rolling release" being updated by sysadmin from producing outdated and wrong test results, to producing correct test results. This is a correct change in line with the goal of our nodes representing what a user running that OS would see building Squid master or PRs. One distro changed compiler and both turned on a new warning by default which exposed existing Squid bugs. Exactly as intended.

IMO we can expect to occur on a regular basis and it is specific to "rolling release" distros. We can resolve it by having those OS only build in the N-matrix applied before releases, instead of the matrix blocking PR tests or merging.

 If we are all agreed, kinkie or I can implement ASAP.


<skip>

B. PR submission testing
   - which OS for master (5-pr-test) ?
   - which OS for beta (5-pr-test) ?
   - which OS for stable (5-pr-test) ?

Are all of those sets the same identical OS+compilers? no.
Why are they forced to be the same matrix test?

I do not understand the question. Are you asking why Jenkins uses the
same 5-pr-test configuration for all three branches (master, beta, _and_
stable)? I do not know the answer.


So can we agree that they should be different tests?

 If we are all agreed, that can be implemented.

After test separation we have the choice of OS to answer those questions I posed.

My idea is to go through distrowatch (see file attached) and sync the tests with OS that provide that vN (or lower) of Squid as part of its release. Of course, following the sysadmin testing process for any additions wanted.



IIRC, policy forced on sysadmin with previous pain complaints.

Complaints, even legitimate ones, should not shape a policy. Goals and
principles should do that.

I remember one possibly related discussion where we were trying to
reduce Jenkins/PR wait times by changing which tests are run at what PR
merging stages, but that is probably a different issue because your
question appears to be about a single merging stage.


I think it was the discussion re-inventing the policy prior to that performance one.


C. merge testing
   - which OS for master (5-pr-auto) ?
   - which OS for beta (5-pr-auto) ?
   - which OS for stable (5-pr-auto) ?
     NP: maintainer does manual override on beta/stable merges.

Are all of those sets the same identical OS+compilers? no.
  Why are they forced to be the same matrix test? Anubis

This is too cryptic for me to understand, but Anubis does not force any
tests on anybody -- it simply checks that the required tests have
passed. I am not aware of any Anubis bugs in this area, but please
correct me if I am wrong.


My understanding was that Anubis only has ability to check PRs against its auto branch which tracks master. Ability to have it track other non-master branches and merge there is not available for use.

If that ability were available, we would need to implement different matrix as with N-pr-test to use it without guaranteed pain points.

IMO we should look into this. But it is a technical project for sysadmin + Eduard to coordinate. Not a policy thing.



D. pre-release testing (snapshots + formal)
   - which OS for master (trunk-matrix) ?
   - which OS for beta (5-matrix) ?
   - which OS for stable (4-matrix) ?

Are all of those sets the same identical OS+compilers? no.
Are we forcing them to use the same matrix test? no.
Are we getting painful experiences from this? maybe.
  Most loud complaints have been about "breaking master" which is the
most volatile branch testing on the most volatile OS.

FWIW, I think you misunderstood what those "complaints" where about. I
do not know how that relates to the above questions/answers though.


Maybe. Our different view on what comprises "breaking master" certainly confuses interpretations when the phrase is used as the problem/complaint/report description.



FTR: the reason all those matrices have '5-' prefix is because several
redesigns ago the system was that master/trunk had a matrix which the
sysadmin added nodes to as OS upgraded. During branching vN the
maintainer would clone/freeze that matrix into an N-foo which would be
used to test the code against OS+compilers which the code in the vN
branch was designed to build on.

I think the above description implies that some time ago we were (more)
careful about (not) adding new nodes when testing stable branches. We
did not want a CI change to break a stable branch. That sounds like the
right principle to me (and it should apply to beta and master as well).
How that specific principle is accomplished is not important (to me) so
CI admins should propose whatever technique they think is best.


Can we have the people claiming pain specify exactly what the pain is
coming from, and let the sysadmin/developer(s) with specialized
knowledge of the automation in that area decide how best to fix it?

We can, and that is exactly what is going on in this thread AFAICT. This
particular thread was caused by CI changes breaking master, and
Francesco was discussing how to avoid such breakages in the future.

There are other goals/principles to observe, of course, and it is
possible that Francesco is proposing more changes to optimize something
else as well, but that is something only he can clarify (if needed).

AFAICT, Francesco and I are on the same page regarding not breaking
master anymore -- he graciously agreed to prevent such breakages in the
future, and I am very thankful that he did. Based on your comments
discussing several cases where such master breakage is, in your opinion,
OK, you currently disagree with that principle. I do not know why.


I think we differ in our definitions of "breaking master". You seem to be including breakage of things in the CI system itself which I consider outside of "master", or expected results of normal sysadmin activity. I hope my response to the two use-cases you present at the top of this email clarify.


Amos

Distrowatch report for Squid versions published:

Squid 5:

        Fedora (rawhide)
        Alpine Linux (3.13.5+)

Squid 4:

        Manjaro Linux
        Ubuntu
        Debian
        openSUSE
        Arch Linux
        Mageia
        FreeBSD
        PCLinuxOS
        CentOS
        Devuan GNU+Linux
        Gentoo Linux
        KNOPPIX
        Red Hat Enterprise Linux
        DragonFly BSD
        OpenBSD
        AlmaLinux OS
        Oracle Linux
        ALT Linux
        Clear Linux
        Calculate Linux
        Univention Corporate Server
        IPFire
        Debian Edu/Skolelinux
        Rocky Linux
        SUSE Linux Enterprise
        Zentyal Server
        NetBSD
        Karoshi
        Springdale Linux
        HardenedBSD
        Exherbo
        Vine Linux
        Untangle NG Firewall
        Network Security Toolkit
        Condres OS (not active)
        Feather Linux (not active)
        Frugalware Linux (not active)
        Lunar Linux (not active)

Squid 3.5:

        EuroLinux
        Funtoo Linux
        Scientific Linux
        SME Server
        Endian Firewall
        Asianux
        Rocks Cluster Distribution
        PLD Linux Distribution
        Devil-Linux (not active)
        Nova (not active)
        Windows Cygwin [Diladele]

Squid 3.4: (dead)

Squid 3.3: (dead)

Squid 3.2: (dead)

Squid 3.1:

        T2 SDE

Squid 3.0: (dead)

Squid 2.7:

        MidnightBSD
        Windows Native (Acme inactive)
_______________________________________________
squid-dev mailing list
squid-dev@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-dev

Reply via email to