Hi All,
if I am parsing this thread correctly it seems we have a number of
options to attack and some are already progressing: tmp misconfig,
docker misconfig, unmatched resources in different CI envs, no
definition of minimal HW requiremenets, etc.
But so far nothing against merging
However, the docker space
> issue needs to be resolved first since we don't have the capacity to
> experiment with those nodes out of commission.
>
ETA on fixing the docker space issues is this/next week. Once that lands we
can take a look at the abnormal CPU usage on some nodes.
> Having parity between CI systems is important, no matter how we approach it.
How much does the hardware allocation (cpu, memory, disk throughput, network
throughput) differ between ASF Jenkins and circle midres? How much does the
container isolation differ?
i.e. why are we seeing bugged tests
>
> What I mean by that specifically: if you under-provision a node with 2
> cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors,
> and the nodes take so long with GC pauses, compaction, streaming, etc that
> they don't correctly complete certain operations in expected time,
>
Just wanted to bring up that we actually started seeing a trend pre-4.0 and
it keeps showing up now on the way to 4.1 - legit bugs are found more in
CircleCI when they do not pop up at all in Jenkins. So my appeal is to keep
checking thoroughly also CircleCI even if some failures are not visible
Bringing discussion from JIRA (CASSANDRA-17729) to here:
Mick said:
> Agree with the notion that Jenkins (lower resources/more contention) is
> better at exposing flakies, but that there's a trade-off between encouraging
> flakies and creating difficult-to-deal-with noise.
I come back to the
I suspect there's another problem with some of the Jenkins nodes where
the system CPU usage is high and drives the load much higher than
other nodes, possibly causing timeouts. However, the docker space
issue needs to be resolved first since we don't have the capacity to
experiment with those
Another option would be to increase the resources dedicated to each agent
container and run less in parallel. Or, best yet, do both (up timeouts and
lower parallelization / up resources).
As far as I can tell the failures on Jenkins aren't value-add compared to what
we're seeing on circleci