Re: [openstack-dev] [kolla] PTG Summary

2018-03-08 Thread Adam Spiers

Paul Bourke  wrote:

Hi all,

Here's my summary of the various topics we discussed during the PTG. 
There were one or two I had to step out for but hopefully this serves 
as an overall recap. Please refer to the main etherpad[0] for more 
details and links to the session specific pads.


[snipped]


self health check support
=
* This had some crossover with the monitoring discussion.
* Kolla has some checks in the form of our 'sanity checks', but these 
are underutilised and not implemented for every service. Tempest or 
rally would be a better fit here.


Actions:
* Remove the sanity check code from kolla-ansible - it's not fit for 
purpose and our assumption is noone is using it.
* Make contact with the self healing SIG, and see if we can help here. 
They may have recommendations for us.

* Make a spec for this.


[snipped]

Would be great to collaborate!  As the SIG is still new we don't have
regular meetings set up yet, but please join #openstack-self-healing
on IRC, and you can mail the openstack-sigs list with [self-healing]
in the subject.


Implement rolling upgrade for all core projects
===
* Started by defining the 'terms of engagement', i.e. what do we mean 
by rolling upgrade in kolla, what we currently have vs. what projects 
support, etc.
* There are two efforts under way here, 1) supporting online upgrade 
for all core projects that support it, 2) supporting FFU(offline) 
upgrade in Kolla.

* lujinluo is working on a way to do online FFU in Kolla.
* Testing - we need gates to test upgrade.

Actions:
* Finish implementation of rolling upgrade for all projects that 
support it in Rocky

* Improve documentation around this and upgrades in general for Kolla
* Spec in Rocky for FFU and associated efforts
* Begin looking at what would be required for upgrade gates in Kolla


Yes, a spec or other docs nailing down exactly what is meant by
rolling upgrade and FFU upgrade would be a great help.  I was in the
FFU session in Dublin and it felt to me like not everyone was on the
same page yet regarding the precise definitions, making it difficult
for all projects to move forward together in a coherent fashion.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [kolla] PTG Summary

2018-03-08 Thread Paul Bourke

Hi all,

Here's my summary of the various topics we discussed during the PTG. 
There were one or two I had to step out for but hopefully this serves as 
an overall recap. Please refer to the main etherpad[0] for more details 
and links to the session specific pads.


build.py script refactor

* I think was little debate that we need this. However, discussion moved 
fairly quickly towards if there's changes we can make to our images that 
will not require maintaining such a large build script in the first place.
* loci images are making good progress and are already in use by 
openstack-helm
  * By moving the start scripts from the kolla images into 
kolla-ansible we can decouple ourselves from these images and open the 
possibility of comsuming images from other sources such as loci.


Actions:
* Do a poc of externalising start scripts (started under 
https://review.openstack.org/#/c/550500/)


plugin split from main images
=
* Plugins continue to be a contentious issue in Kolla
* The current approach of installing all available plugins 'out of the 
box' is not working for certain users.
* Sam Betts had a good example of why this is not working for them, I 
don't feel I can summarise it properly. Will reach out to him to clarify.
* We didn't reach a conclusion on this, it seems there are pros and cons 
to each approach. Needs further discussion and possibly some pocs.


ansible "--check" and "--diff" mode
===
* Operators would like to see some dry run like features in kolla-ansible.
* Would like to see the return of something like genconfig, where 
configs can be generated ahead of time and diffed/reviewed before deploy.
* Also some general discussion in this session on management and scaling 
difficulties with kolla.

* Inventory management needs to be more flexible.
* Operations are too slow once you hit about 200 nodes, operators are 
finding they have to use manual trickery to divide up their inventories.

* A lot of operations take place when very little has changed config wise.

Actions:
* No specific actions came out of this at this time. I think we'd need 
more time on this topic to determine specific work items that can make 
improvements here.


Database backup & recovery
==
* Interesting topic, all in agreement kolla should provide some 
functionality in this area.
* Discussion around which areas of responsibility fall on kolla vs. the 
operator. E.g. 'kolla should allow for regular database backups, how 
those are restored is beyond project scope'

* yankcrime has done some ground work on this as well as a poc.
* Good documentation is important here.

Actions:
* Review yankcrime's poc and provide feedback
* Form a spec detailing what mechanism we want to use to trigger 
backups, etc.


ceph-ansible

* All seem in agreement that the issues and work seen in migrating to 
ceph-ansible currently outweigh the benefits.
* Decided to stick with improving kolla ceph for now, with bluestore 
support being a priority.


Actions:
* Write a blueprint to add support for bluestore 
(https://blueprints.launchpad.net/kolla/+spec/kolla-ceph-bluestore)
* Update docs to better inform operators on why they may or may not want 
to use kolla ceph vs the alternatives.


Prometheus support for monitoring
=
* There have been some previous attempts to add a monitoring stack in 
Kolla, though none have come to fruition.
* Oracle are looking at prometheus and what it will take to integrate 
that to Kolla to fill this gap.


Actions:
* Write spec to detail how this will work.
* Do the work.

self health check support
=
* This had some crossover with the monitoring discussion.
* Kolla has some checks in the form of our 'sanity checks', but these 
are underutilised and not implemented for every service. Tempest or 
rally would be a better fit here.


Actions:
* Remove the sanity check code from kolla-ansible - it's not fit for 
purpose and our assumption is noone is using it.
* Make contact with the self healing SIG, and see if we can help here. 
They may have recommendations for us.

* Make a spec for this.

destroy service & node
==
* Several aspects to this:
  * We would like to be able to remove an individual service as part of 
kolla-ansible destroy

* It is not clear what best practice is to remove a control node in Kolla
* Likewise for compute
* This could be automated but documentation would go a long way here also.

Actions:
* Clearly document how to remove a control/compute node from a kolla 
deployment.


integrate with docker-compose
=
* This is something Jeffrey is working on so we didn't have much to 
contribute in the way of discussion.


Actions:
* Review and provide feedback on https://review.openstack.org/538581

Implement rolling upgrade for all core projects