Hi guys, It's not a big secret that working on "races" especially in OpenStack gates is quite complicated task.
Current workflow: 1) Some tempest test run fails 2) Build rules for elastic-recheck & report bug 3) Recheck specific bug XXX 4) Collect stats 5) Attempt to fix BUG, believe that it fixes bug and MERGE IT! 6) Montior stats: http://status.openstack.org/elastic-recheck/ 7) Repeat 5-6 until bug is fixed With Rally we can improve this workflow. As you probably know, in many projects we are running rally-job. Usually it's called "gate-rally-dsvm-<something>" This job does simple thing: 1) Run dsvm job that installs OpenStack + Rally 2) Run Rally Task (set of benchmarks) against this Cloud 3) Create a pretty page with results: http://logs.openstack.org/71/137671/1/check/gate-rally-dsvm-rally/6dc39b6/ 4) Put +1/-1 vote depending on criteria of success (sla) of benchmarks specified in task This job is very precise and flexible opposite to tempest job that just run predefined in tempest and infra set of functional tests. You have a plugins dir: 1) https://github.com/openstack/cinder/tree/master/rally-jobs/plugins where you can put plugins. In Rally almost everything is pluggable: success criteria, load generators, benchmark scenarios and context,... 2) You have task file with specification of what benchmark to run: https://github.com/openstack/cinder/blob/master/rally-jobs/cinder.yaml That allows you to specify what benchmarks to run in gates. New workflow for fixing races with Rally: 1) Create or use existing benchmark that test code that will reproduce raices close to 100%. 2) Push patch to review. And ensure that rally job fails 3) Push fix + in depending patch changes in rally task file that reproduce bug 4) If bug is not reproduced merge first patch and abandon change with rally task changes. 5) PROFIT! As a demo I made changes in rally task that reproduces cinder high priority bug (volumes are not attached): https://bugs.launchpad.net/nova/+bug/1240728 So here is the patch: 1) https://review.openstack.org/#/c/137885/ 2) We are specifying in rally task to run 11 times benchmark, that simultaneously do 4 scenarios: create server, create volume, attach to server volume, detach volume, delete server. 3) We can see that Rally job return -1. After that we can click on it's url and see this page: http://logs.openstack.org/85/137885/2/check/gate-rally-dsvm-cinder/7790157/ 4) There are 2 interesting links on it: A) HTML reprot. That shows actually what benchmark failed: http://logs.openstack.org/85/137885/2/check/gate-rally-dsvm-cinder/7790157/rally-plot/results.html.gz B) DSVM logs (Logs of all services): http://logs.openstack.org/85/137885/2/check/gate-rally-dsvm-cinder/7790157/logs/ And here you can find cinder logs and actually exception that occurs: screen-c-vol.txt.gz -> http://logs.openstack.org/85/137885/2/check/gate-rally-dsvm-cinder/7790157/logs/screen-c-vol.txt.gz?level=ERROR So now we can repeat race condition in gates with close to 100% likelihood in other words we are able to test that fix really fix this issue. Happy bug fixing!=) Best regards, Boris Pavlovic
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev