[Virt-test-devel] RFC: test failures and notifications revamp

Ademar de Souza Reis Jr. Thu, 09 May 2013 08:33:28 -0700

Hi.

I've been discussing this with Cleber and Lucas (in private just
because I'm their manager at Red Hat) and decided to open this to
the general audience of autotest, in the hope that we'll get more
ideas and refine the brainstorm:


We have an internal testgrid that runs some of the virt-tests:
but for each tests that passes, we've been getting at least 2
notifications of failures. The absolute majority of them is due
to infrastructure problems (network, some repository is offline,
disk is full, NFS failure, Cobbler failure, job aborted, etc).

This is a historical problem in autotest and we want a clean
solution to solve it for good, without kludges or ugly hacks.

So I propose we think outside of the box: what would be the ideal
solution to this problem, without the limitations imposed by the
current autotest architecture or backwards compatibility?

Once we have this ideal solution as a goal, we start thinking of
what needs to be sacrificed because of the autotest architecture,
not the other way around.

Naturally, the solution can be implemented in phases.

Here is my proposal, at the requirements level:

Definitions:
  - Testgrid: the infrastructure used to run tests. It's composed
    of test runner(s), scheduler(s), RPC server(s), database(s),
    infrastructure for provisioning, etc.
  - Autotest user: submits jobs to the testgrid and/or monitors the
    output of the jobs run;
  - Testgrid admin: responsible for the maintenance of the
    testgrid, fixing the infrastructure and the services that it
    depends on;

Requirements (as user stories)
  - As an autotest user, I want to be able to declare
    requirements for my test to be run. For example, I may need a
    specific package installed, or a specific service to be
    online. Besides, the test runner should automatically find
    out some requirements  based on the test code I write. For
    example: if I use a method exposed by autotest that has a
    dependency on a particular service or package, the test
    runner should automatically consider it a requirement of my
    test as well.

  - As an autotest user, I want two primary kinds of
    notifications sent to me over e-mail: either the test run and
    passed, or the test run and failed (note: the test did run).
    Receiving a notification of a test failure should be like an
    alarm: it means there's something broken with *my code* and
    needs immediate attention. False positives should be a very
    rare exception. Test jobs that failed due to broken
    infrastructure or broken services should be kept in a queue
    for a (long) period of time until the infrastructure gets
    fixed. After that period, they should be aborted,
    potentially sending me a notification e-mail.

  - As an autotest user, I want the e-mail that notifies me of
    the job status to be consistent and clear about what went
    wrong. It should include links to more detailed information,
    log snippets, version of the components run, failure rate,
    etc. I don't want e-mails with missing fields or inconsistent
    reports.
    
  - As an autotest user I should be able to query the testgrid
    queue, my job status and the testgrid status via some sort of
    webservice API. A dashboard and rpc-client using this
    API would be great.
    
  - As a testgrid admin, I want to be notified if a service is
    broken or offline. I want to have scripts or tests that
    monitor these services and pause the testgrid if something is
    wrong, putting the test jobs on hold.
    
  - As a testgrid admin I want to tell for how long the testgrid
    was offline due to broken infrastructure and list which
    services went down and when, to have a general idea of what's
    broken and needs to be fixed in the long term.

  - As a testgrid admin, I want to be able to select which tests
    should run on which platform/hardware/OS. For example, I want
    to blacklist some tests (or variants) from a particular
    machine in the virtlab, or from a particular version of the
    OS.

Comments?

Thanks.
   - Ademar

-- 
Ademar de Souza Reis Jr.
Red Hat

^[:wq!

_______________________________________________
Virt-test-devel mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/virt-test-devel

[Virt-test-devel] RFC: test failures and notifications revamp

Reply via email to