Re: [Development] results of July flaky tests fixing

2022-09-01 Thread Thiago Macieira
On Thursday, 1 September 2022 06:28:55 -03 Volker Hilsheimer wrote:
> * stress tests for data races: if your test doesn’t expose any race
> conditions if you run with QThread::idealThreadCount threads, then it’s
> unlikely that it will expose races if you run with more threads. But with
> time sharing, the threads might run a lot longer than you expect. See e.g.
> https://codereview.qt-project.org/c/qt/qtbase/+/421391

Also, be careful with scaling. If your code has quadratic scaling, 8 cores 
isn't twice as bad as 4 cores, compared to 2; it's actually 16x worse. And if 
I run on 48 cores on a 2-socket NUMA system, it's 576x worse. Or worse.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] results of July flaky tests fixing

2022-09-01 Thread Edward Welbourne
Volker Hilsheimer (1 September 2022 11:28) wrote (inter alia):
> * hardcoded waiting times is an anti-pattern.

A good way to avoid them is to use the QTRY_*() family of macros, as
long as you can find something that shall become true by the time the
waiting is no longer needed.

Eddy.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] results of July flaky tests fixing

2022-09-01 Thread Volker Hilsheimer
Thanks for sharing that overview, Anna!


When I chased after failures of my integrations caused by flaky tests during 
summer, I’ve seen a couple of patterns I think are worth keeping in mind when 
investigating flaky tests, or when writing new tests:

* QTest::qWaitForWindowActive - very often, a test doesn’t need an active 
window at all, but just an exposed window. Use QTest::qWaitForWindowExposed 
instead

* stress tests for data races: if your test doesn’t expose any race conditions 
if you run with QThread::idealThreadCount threads, then it’s unlikely that it 
will expose races if you run with more threads. But with time sharing, the 
threads might run a lot longer than you expect. See e.g. 
https://codereview.qt-project.org/c/qt/qtbase/+/421391

* hardcoded waiting times is an anti-pattern. I know it’s not always possible 
to avoid (we don’t have qWaitFor… helpers for everything), but when testing 
high-level functionality that relies on lower-level functionality, then it’s a 
good idea to check that the lower-level bits worked. E.g. 
https://codereview.qt-project.org/c/qt/qtbase/+/421658


To the last point - tests can use our private APIs, so adding private 
infrastructure that makes it easier to write robust tests is a good idea!


Cheers,
Volker



> On 30 Aug 2022, at 18:53, Anna Wojciechowska  wrote:
> 
> Hello,
> 
> I would like to present the results of the July fixing of flaky tests.
> 
> short version:
> 19 - number of platform-specific flakiness to start with: (14 different 
> tests) 
> 
> as a result:
> 11 platform-specific flakiness was fixed (caused by 8 different tests) 
> 4 platform specific flakiness was still flaky (from 4 different tests)
> 4 cases of flakiness were blacklisted (2 different tests)
> 
> The table under the link below shows more detailed information about fixed 
> tests.
> https://wiki.qt.io/Fixed_flaky_tests_in_July_2022
> 
> long version:
> How was the problem approached? 
> We collected data about flakiness from June, in July we created a list of top 
> "worst" cases that failed integrations and we contacted module maintainers. 
> We gave some time for changes to be merged and run a sufficient number of 
> times to gain confidence that the fix actually worked - and in late August we 
> checked the results again.
> 
> The complete lists of flaky tests from June that were being fixed in July can 
> be found at this link:
> https://testresults.qt.io/grafana/d/7/flaky-summary-ci-test-info?orgId=1=165662640=1659304799000=65
> 
> Which tests were taken into analysis? 
> The tests from dev branch that impacted negatively the integration system by 
> causing at least 1 failure in any integrations and at least 1 flaky event.
> 
> What is the difference between a failed and a flaky test?
> You can find a good explanation here:
> https://testresults.qt.io/grafana/d/7/flaky-summary-ci-test-info?orgId=1=55
> and here:
> https://testresults.qt.io/grafana/d/7/flaky-summary-ci-test-info?orgId=1=41
> 
> What is understood by a "test"?
> A test is an umbrella term for a pair: test case and test function. A test 
> case (usually a cpp file) contains several test functions that return results 
> (pass, fail, or xfail). We collected and analyzed the results. Additionally, 
> some tests contain data tags - test function arguments that also provide more 
> detailed results, however, we do not store them, the granularity of the data 
> ends at the test function level.
> 
> What is understood by a "platform-specific flakiness"?
> A test runs on a specific platform - we describe it as "target operating 
> system" and "target architecture". In most cases, flakiness is related to a 
> particular test run on a specific platform. 
> E.g., test case: "tst_qmutex" , test function "more stress" can return stable 
> results most platforms but be flaky on: MacOS_11 X86_64 or on Windows_10_21H2 
> X86_64 . In such case, it will be counted as 2 "platform specific flakiness" 
> (MacOS_11 and Windows_10_21H2_) caused by a single (unique, distinct) test.
> 
> Since July fixing provided good results, in August we repeated the procedure: 
> we gathered data about the most damaging (failing integrations) flaky tests 
> and we compared it to July, to make sure only "new" tests are on the list. 
> August's failing flakiness can be viewed under the link below.  Developers 
> and maintainers are welcome to check if their tests are on the list.
> https://wiki.qt.io/Flaky_tests_that_caused_failures_in_August
> 
> Big thanks to everyone participating in fixing the tests!
> 
> Anna Wojciechowska
> 
> The notebooks used to prepare this analysis can be found at:
> https://git.qt.io/qtqa/notebooks/-/tree/main/flakiness/august_2022
> 
> ___
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development

___
Development mailing list

[Development] results of July flaky tests fixing

2022-08-30 Thread Anna Wojciechowska
Hello,


I would like to present the results of the July fixing of flaky tests.


short version:

19 - number of platform-specific flakiness to start with: (14 different tests)


as a result:

11 platform-specific flakiness was fixed (caused by 8 different tests)

4 platform specific flakiness was still flaky (from 4 different tests)

4 cases of flakiness were blacklisted (2 different tests)


The table under the link below shows more detailed information about fixed 
tests.

https://wiki.qt.io/Fixed_flaky_tests_in_July_2022


long version:

How was the problem approached?

We collected data about flakiness from June, in July we created a list of top 
"worst" cases that failed integrations and we contacted module maintainers. We 
gave some time for changes to be merged and run a sufficient number of times to 
gain confidence that the fix actually worked - and in late August we checked 
the results again.


The complete lists of flaky tests from June that were being fixed in July can 
be found at this link:

https://testresults.qt.io/grafana/d/7/flaky-summary-ci-test-info?orgId=1=165662640=1659304799000=65


Which tests were taken into analysis?

The tests from dev branch that impacted negatively the integration system by 
causing at least 1 failure in any integrations and at least 1 flaky event.


What is the difference between a failed and a flaky test?

You can find a good explanation here:

https://testresults.qt.io/grafana/d/7/flaky-summary-ci-test-info?orgId=1=55

and here:
https://testresults.qt.io/grafana/d/7/flaky-summary-ci-test-info?orgId=1=41


What is understood by a "test"?

A test is an umbrella term for a pair: test case and test function. A test case 
(usually a cpp file) contains several test functions that return results (pass, 
fail, or xfail). We collected and analyzed the results. Additionally, some 
tests contain data tags - test function arguments that also provide more 
detailed results, however, we do not store them, the granularity of the data 
ends at the test function level.


What is understood by a "platform-specific flakiness"?

A test runs on a specific platform - we describe it as "target operating 
system" and "target architecture". In most cases, flakiness is related to a 
particular test run on a specific platform.

E.g., test case: "tst_qmutex" , test function "more stress" can return stable 
results most platforms but be flaky on: MacOS_11 X86_64 or on Windows_10_21H2 
X86_64 . In such case, it will be counted as 2 "platform specific flakiness" 
(MacOS_11 and Windows_10_21H2_) caused by a single (unique, distinct) test.


Since July fixing provided good results, in August we repeated the procedure: 
we gathered data about the most damaging (failing integrations) flaky tests and 
we compared it to July, to make sure only "new" tests are on the list. August's 
failing flakiness can be viewed under the link below.  Developers and 
maintainers are welcome to check if their tests are on the list.

https://wiki.qt.io/Flaky_tests_that_caused_failures_in_August


Big thanks to everyone participating in fixing the tests!


Anna Wojciechowska


The notebooks used to prepare this analysis can be found at:

https://git.qt.io/qtqa/notebooks/-/tree/main/flakiness/august_2022

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development