Le 04/12/2020 à 03:21, Jean Helou a écrit : > Hello fellow jamers ! > > The Jenkinsfile in the PR works, up until the test suite fails, the tests > failures are from seemingly "unstable" tests that fail because of timing > issues. Benoit fixed the first one in > https://github.com/apache/james-project/pull/267 by disabling read repairs > during consistency checks (I have no idea what it means but it sounds > awesome :) ), I fixed the second one in > https://github.com/apache/james-project/pull/269 where the event bus sender > and receivers where closed out of order on shutdown sometimes leading up to > events being sent to a closed receiver. > > After some cleanup, Matthieu recreated a buildable PR which lead to yet > another unstable test in > https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/1/tests We have been encoutering this for a while. Thanks for digging in. > > I started investigating the issue and ended up roping in Matthieu since the > symptoms for the issue left me completely puzzled. Matthieu managed to > pinpoint the root cause to a NPE sometimes thrown from > within org.apache.james.server.core.MimeMessageCopyOnWriteProxy which in > turn triggered further NullPointerExceptions in the mailet pipeline error > handling code. > We finally confirmed a concurrency issue in the refcounting management of > the proxy which if I understand correctly can lead to unrecoverable data > loss. We wrote a test to trigger it [1] in an almost deterministic manner. I'm in favor for opening a dedicated ticket and merge a disabled version of this test in order to document the problem. > > Once we had a test to reproduce the race condition, we tried to fix the > issue only to realize that it led to even more concurrency issues. The > rather depressing conclusion we reached yesterday was that the whole > implementation is currently unsound with regard to concurrency. I am unable > to estimate the resolution effort at this point, Matthieu has some ideas > and will work on it (as well as I) when time allows. > > Which leads me to my current interrogations: I feel that fixing such long > standing issues in the test suite is not actually part of configuring the > apache CI but I am unsure how to proceed. +1 > > Here is what I would like to do at this stage : > - Isolate the unstable tests under with an unstable tag (akin to "feature > tags") I'd advocate a @Disabled tag, referencing both a JIRA ticket specific to the bugfix needed, and the JIRA of the CI build.
Having a list of such issues in the JIRA (CI setup) ticket would be valuable. I'd even advise doing subtickets to have a nice checklist. > - exclude these tests from the default surefire execution profile, > - add a parallel pipeline step for these tests where the step failure > doesn't fail the pipeline [2] > - ensure that the build is green > - merge so the project finally has a working public CI > > I intend to start working on this quickly so we can all enjoy a functional > public CI. +1 I agree on the approach. > > Alternatives: > - Merge the jenkinsfile after the whole pipeline has been tested in the PR > branch, which may not happen in a short-medium term... > - Merging as is, means that many builds on PRs will end up failing and the > last steps (snapshot publish) might fail even if the testsuite succeeds > since it never ran. > - Something I haven't thought of ? > > Another issue I want to raise is the availability of the CI builds. As you > have seen from my experiments, the CI triggers configuration will only > build commits from : > - all branches of the main repository > - all PRs opened from the main repository > - all PRs opened by someone with write access to the main repository > > Which means that PRs for external contributors will not be built at all. > > I tried adding the issueCommentTrigger to the jenkins file but neither my > comments nor those of someone with commit access were able to trigger the > build. > > I think that one of the project members should revise the current settings > to make it possible to build external contributors PR one way or another. > (only project members have access or can have access to the jenkins project > configuration). > Here are two options: > - the easiest and quickest modification is to let the CI build all and > every PR, there are relatively few PRs on james so the burden on the CI > platform shouldn't be too bad. > - alternatively it may be possible to configure jenkins to require a > comment for someone with write access to trigger a build. unfortunately I > am not certain how to set this up, maybe INFRA can help. Having a build in the first place, even with the restrictions you describe sounds like a good progress to me. I agree we need to see what other Apache projects are doing, and if needed ask the INFRA. > > I know this was a long piece, I look forward to reading your opinions ! Thanks for your involvement on this topic. Benoit > Jean > > [1] see > https://github.com/jeantil/james-project/tree/james-3225-concurrency-bug-mimemessagecow > [2] see > https://stackoverflow.com/questions/44022775/jenkins-ignore-failure-in-pipeline-build-step > > On Thu, Nov 26, 2020 at 11:22 AM Jean Helou <jean.he...@gmail.com> wrote: > >> The good news is that docker does indeed work, the bad news is that the >> tests fail with an issue that's too involved for me :/ >> >> [INFO] >> [INFO] Results: >> [INFO] >> [ERROR] Failures: >> [ERROR] >> CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails:433 >> Multiple Failures (1 failure) >> >> Expecting: >> <[]> >> to contain exactly (and in same order): >> <[#private:user:INBOX]> >> but could not find the following elements: >> <[#private:user:INBOX]> >> >> at >> CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.lambda$deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails$8(CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.java:440) >> >> so unless the build for >> >> * 6fab99364a - JAMES-3448 Rewrite links to http://james.apache.org/server/3/ >> (Mon Nov 23 15:10:36 2020 +0700) <Benoit Tellier> N >> >> is broken which sounds unlikely, I'm going to need help >> >> jean >> >> On Thu, Nov 26, 2020 at 10:53 AM Jean Helou <jean.he...@gmail.com> wrote: >> >>> on a loosely related note : the test suite logs are scary to look at: >>> piles upon piles of stack traces and error logs but the tests actually pass >>> ... >>> >>> On Thu, Nov 26, 2020 at 10:50 AM Jean Helou <jean.he...@gmail.com> wrote: >>> >>>> Thanks benoit, >>>> >>>> Matthieu pointed me to numerous apache projects with jenkinsfiles which >>>> mention docker in >>>> https://github.com/search?q=org%3Aapache++filename%3AJenkinsfile+docker&type=Code >>>> so I'm trying out things based on that >>>> >>>> the logs seem promising so far : >>>> ``` >>>> >>>> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: >>>> 0.697 s - in >>>> org.apache.james.backends.rabbitmq.RabbitMQConnectionFactoryTest >>>> ℹ︎ Checking the system... >>>> ✔ Docker version should be at least 1.6.0 >>>> ✔ Docker environment should have more than 2GB free disk space >>>> [INFO] Running org.apache.james.backends.rabbitmq.RabbitMQTest >>>> ``` >>>> >>>> >>>> On Thu, Nov 26, 2020 at 10:40 AM Tellier Benoit <btell...@apache.org> >>>> wrote: >>>> >>>>> Done >>>>> >>>>> Le 26/11/2020 à 16:25, Jean Helou a écrit : >>>>>> hi all, >>>>>> >>>>>> As you know I started a PR to setup jenkins CI, the latest iteration >>>>> sees >>>>>> the compilation of the project complete in 5 minutes ( thanks to T1C) >>>>> but >>>>>> the tests fail to initialize docker containers with the disastrous >>>>>> consequences you can imagine :D >>>>>> >>>>>> I opened https://issues.apache.org/jira/browse/INFRA-21144 to ask if >>>>> it is >>>>>> possible to have the docker service enable don some nodes, since I am >>>>> not >>>>>> official member of the project I think it may be useful if you chimed >>>>> in on >>>>>> the ticket to confirm that this is a legitimate request. >>>>>> >>>>>> Best regards, >>>>>> Jean >>>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org >>>>> For additional commands, e-mail: server-dev-h...@james.apache.org >>>>> >>>>> --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org