Hi,

The stable releases (hammer, infernalis) did not make progress in the past few 
weeks because we can't run tests.

Before xmas the following happened:

* the sepia lab was migrated and we discovered the OpenStack teuthology backend 
can't run without it (that was a problem during a few days only)
* there are OpenStack specific failures in each teuthology suites and it is non 
trivial to separate them from genuine backport errors
* the make check bot went down (it was partially running on my private hardware)

If we just wait, I'm not sure when we will be able to resume our work because:

* the sepia lab is back but has less horsepower than it did
* not all of us have access to the sepia lab
* the make check bot is being worked on by the infrastructure team but it is 
low priority and it may take weeks before it's back online
* the ceph-qa-suite errors that are OpenStack specific are low priority and it 
may never be fixed

I think we should rely on the sepia lab for testing for the foreseeable future 
and wait for the make check bot to be back. Tests will take a long time to run, 
but we've been able to work with a one week delay before so it's not a blocker.

Although fixing OpenStack specific errors would allow us to use the teuthology 
OpenStack backend (I will fix the last error left in the rados suite), it is 
unrealistic to set that as a requirement to run tests: we don't have the 
workforce nor the skills to do that. Hopefully, some time in the future, Ceph 
developers will  use ceph-qa-suite on OpenStack as part of the development 
workflow. But right now running ceph-qa-suite on OpenStack suites is outside of 
the development workflow and in a state of continuous regression which is 
inconvenient for us because we need something stable to compare the runs from 
the integration branch.

Fixing the make check bot is a two part problem. Each failed run must be looked 
at to chase false negatives (continuous integration with false negatives is a 
plague), which I did in the past year on a daily basis and I'm happy to keep 
doing. Before xmas break the bot running at jenkins.ceph.com sent over 90% 
false negative, primarily because it was trying to run on unsupported operating 
systems and it was stopped until this is fixed. It also appears that the 
machine running the bot is not re-imaged after each test, meaning a bugous run 
may taint all future tests and create a continuous flow of false negative. 
Addressing these two issues require knowing or learning about the Ceph jenkins 
setup and slave provisioning. This probably is a few days of work, reason why 
the infrastructure team can't resolve that immediately.

If you have alternative creative ideas on how to improve the current situation, 
please speak up :-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to