Re: [openstack-dev] [ClusterLabs Developers] [HA] future of OpenStack OCF resource agents (was: resource-agents v4.2.0)

2018-10-25 Thread Adam Spiers

Hi Tim,

Tim Bell  wrote: 

Adam,

Personally, I would prefer the approach where the OpenStack resource agents are part of the repository in which they are used. 


Thanks for chipping in. 

Just checking - by this you mean the resource-agents rather than 
openstack-resource-agents, right?  Obviously the agents aren't usable 
as standalone components in either context, without a cloud's worth of 
dependencies including Pacemaker. 

This is also the approach taken in other open source projects such as Kubernetes and avoids the inconsistency where, for example, Azure resource agents are in the Cluster Labs repository but OpenStack ones are not. 


Right.  I suspect there's no clearly defined scope for the 
resource-agents repository at the moment, so it's probably hard to say 
"agent X belongs here but agent Y doesn't".  Although has been alluded 
to elsewhere in this thread, that in itself could be problematic in 
terms of the repository constantly growing. 

This can mean that people miss there is OpenStack integration available. 


Yes, discoverability is important, although I think we can make more 
impact on that via better documentation (another area I am struggling 
to make time for ...) 

This does not reflect, in any way, the excellent efforts and results made so far. I don't think it would negate the possibility to include testing in the OpenStack gate since there are other examples where code is pulled in from other sources. 


There are a number of technical barriers, or at very least 
inconveniences, here - because the resource-agents repository is 
hosted on GitHub, therefore none of the normal processes based around 
Gerrit apply.  I guess it's feasible that since Zuul v3 gained GitHub 
support, it could orchestrate running OpenStack CI on GitHub pull 
requests, although it would have to make sure to only run on PRs which 
affect the OpenStack RAs, and none of the others. 

Additionally, we'd probably need tags / releases corresponding to each 
OpenStack release, which means polluting a fundamentally 
non-OpenStack-specific repository with OpenStack-specific metadata. 

I think either way we go, there is ugliness.  Personally I'm still 
leaning towards continued use of openstack-resource-agents, but I'm 
happy to go with the majority consensus if we can get a 
semi-respectable number of respondees :-) 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ClusterLabs Developers] [HA] future of OpenStack OCF resource agents (was: resource-agents v4.2.0)

2018-10-24 Thread Adam Spiers

[cross-posting to openstack-dev]

Oyvind Albrigtsen  wrote:

ClusterLabs is happy to announce resource-agents v4.2.0.
Source code is available at:
https://github.com/ClusterLabs/resource-agents/releases/tag/v4.2.0

The most significant enhancements in this release are:
- new resource agents:


[snipped]


- openstack-cinder-volume
- openstack-floating-ip
- openstack-info


That's an interesting development.

By popular demand from the community, in Oct 2015 the canonical
location for OpenStack-specific resource agents became:

   https://git.openstack.org/cgit/openstack/openstack-resource-agents/

as announced here:

   http://lists.openstack.org/pipermail/openstack-dev/2015-October/077601.html

However I have to admit I have done a terrible job of maintaining it
since then.  Since OpenStack RAs are now beginning to creep into
ClusterLabs/resource-agents, now seems a good time to revisit this and
decide a coherent strategy.  I'm not religious either way, although I
do have a fairly strong preference for picking one strategy which both
ClusterLabs and OpenStack communities can align on, so that all
OpenStack RAs are in a single place.

I'll kick the bikeshedding off:

Pros of hosting OpenStack RAs on ClusterLabs


- ClusterLabs developers get the GitHub code review and Travis CI
 experience they expect.

- Receive all the same maintenance attention as other RAs - any
 changes to coding style, utility libraries, Pacemaker APIs,
 refactorings etc. which apply to all RAs would automatically
 get applied to the OpenStack RAs too.

- Documentation gets built in the same way as other RAs.

- Unit tests get run in the same way as other RAs (although does
 ocf-tester even get run by the CI currently?)

- Doesn't get maintained by me ;-)

Pros of hosting OpenStack RAs on OpenStack infrastructure
-

- OpenStack developers get the Gerrit code review and Zuul CI
 experience they expect.

- Releases and stable/foo branches could be made to align with
 OpenStack releases (..., Queens, Rocky, Stein, T(rains?)...)

- Automated testing could in the future spin up a full cloud
 and do integration tests by simulating failure scenarios,
 as discussed here:

 https://storyboard.openstack.org/#!/story/2002129

 That said, that is still very much work in progress, so
 it remains to be seen when that could come to fruition.

No doubt I've missed some pros and cons here.  At this point
personally I'm slightly leaning towards keeping them in the
openstack-resource-agents - but that's assuming I can either hand off
maintainership to someone with more time, or somehow find the time
myself to do a better job.

What does everyone else think?  All opinions are very welcome,
obviously.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [infra] Gerrit User Summit, November 2018

2018-10-02 Thread Adam Spiers

Hi all,

The next forthcoming Gerrit User Summit 2018 will be Nov 15th-16th in
Palo Alto, hosted by Cloudera.

See the Gerrit User Summit page at:

   https://gerrit.googlesource.com/summit/2018/+/master/index.md

and the event registration at:

   https://gus2018.eventbrite.com

Hopefully some members of the OpenStack community can attend the
event, not just so we can keep up to date with Gerrit but also so that
our interests can be represented!

Regards,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-sigs] [self-healing][heat][vitrage][mistral] Self-Healing with Vitrage, Heat, and Mistral

2018-08-13 Thread Adam Spiers

Hi Rico,

Firstly sorry for the slow reply!  I am finally catching up on my
backlog.

Rico Lin  wrote:

Dear all

Back to Vancouver Summit, Ifat brings out the idea of integrating Heat,
Vitrage, and Mistral to bring better self-healing scenario.
For previous works, There already works cross Heat, Mistral, and Zaqar for
self-healing [1].
And there is works cross Vitrage, and Mistral [2].
Now we plan to start working on integrating two works (as much as it
can/should be) and to make sure the scenario works and keep it working.
The integrated scenario flow will look something like this:
An existing monitor detect host/network failure and send an alarm to
Vitrage -> Vitrage deduces that the instance is down (based on the topology
and based on Vitrage templates [2]) -> Vitrage triggers Mistral to fix the
instance -> application is recovered
We created an Etherpad [3] to document all discussion/feedbacks/plans (and
will add more detail through time)
Also, create a story in self-healing SIG to track all task.

The current plans are:

  - A spec for Vitrage resources in Heat [5]
  - Create Vitrage resources in Heat
  - Write Heat Template and Vitrage Template for this scenario
  - A tempest task for above scenario
  - Add periodic job for this scenario (with above task). The best place
  to host this job (IMO) is under self-healing SIG


This is great!  It's a perfect example of the kind of cross-project
collaboration which I always hoped the SIG would host.  And I really
love the idea of Heat making it even easier to deploy Vitrage
templates automatically.

Originally I thought that this would be too hard and that the SIG
would initially need to focus on documenting how to manually deploy
self-healing configurations, but supporting automation early on is a
very nice bonus :-)  So I expect that implementing this can make lives
a lot easier for operators (and users) who need self-healing :-)

And yes, I agree that the SIG would be the best place to host this
job.


To create a periodic job for self-healing sig means we might also need a
place to manage those self-healing tempest test. For this scenario, I think
it will make sense if we use heat-tempest-plugin to store that scenario
test (since it will wrap as a Heat template) or use vitrage-tempest-plugin
(since most of the test scenario are actually already there).


Sounds good.


Not sure what will happen if we create a new tempest plugin for
self-healing and no manager for it.


Sorry for my ignorance - do you mean manager objects here[0], or some
other kind of manager?

[0] https://docs.openstack.org/tempest/latest/write_tests.html#manager-objects


We still got some uncertainty to clear during working on it, but the big
picture looks like all will works(if we doing all well on above tasks).
Please provide your feedback or question if you have any.
We do needs feedbacks and reviews on patches or any works.
If you're interested in this, please join us (we need users/ops/devs!).

[1] https://github.com/openstack/heat-templates/tree/master/hot/autohealing
[2]
https://github.com/openstack/self-healing-sig/blob/master/specs/vitrage-mistral-integration.rst
[3] https://etherpad.openstack.org/p/self-healing-with-vitrage-mistral-heat
[4] https://storyboard.openstack.org/#!/story/2002684
[5] https://review.openstack.org/#/c/578786


Thanks a lot for creating the story in Storyboard - this is really
helpful :-)

I'll try to help with reviews etc. and maybe even testing if I can
find some extra time for it over the next few months.  I can also try
to help "market" this initiative in the community by promoting
awareness and trying to get operators more involved.

Thanks again!  Excited about the direction this is heading in :-)

Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] PTG Denver Horns

2018-08-08 Thread Adam Spiers

Matthew Thode  wrote:

On 18-08-07 23:18:26, David Medberry wrote:

Requests have finally been made (today, August 7, 2018) to end the horns on
the train from Denver to Denver International airport (within the city
limits of Denver.) Prior approval had been given to remove the FLAGGERS
that were stationed at each crossing intersection.

Of particular note (at the bottom of the article):

There’s no estimate for how long it could take the FRA to approve quiet
zones.

ref:
https://www.9news.com/article/news/local/next/denver-officially-asks-fra-for-permission-to-quiet-a-line-horns/73-581499094

I'd recommend bringing your sleeping aids, ear plugs, etc, just in case not
approved by next month's PTG. (The Renaissance is within Denver proper as
near as I can tell so that nearby intersection should be covered by this
ruling/decision if and when it comes down.)


Thanks for the update, if you are up to it, keeping us informed on this
would be nice, if only for the hilarity.


Thanks indeed for the warning.

If the approval doesn't go through, we may need to resume the design
work started last year; see lines 187 onwards of

   https://etherpad.openstack.org/p/queens-PTG-feedback

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [sig][upgrades][ansible][charms][tripleo][kolla][airship] reboot or poweroff?

2018-08-03 Thread Adam Spiers

[Adding openstack-sigs list too; apologies for the extreme
cross-posting, but I think in this case the discussion deserves wide
visibility.  Happy to be corrected if there's a better way to handle
this.]

Hi James,

James Page  wrote:

Hi All

tl;dr we (the original founders) have not managed to invest the time to get
the Upgrades SIG booted - time to hit reboot or time to poweroff?


TL;DR response: reboot, absolutely no question!  My full response is
below.


Since Vancouver, two of the original SIG chairs have stepped down leaving
me in the hot seat with minimal participation from either deployment
projects or operators in the IRC meetings.  In addition I've only been able
to make every 3rd IRC meeting, so they have generally not being happening.

I think the current timing is not good for a lot of folk so finding a
better slot is probably a must-have if the SIG is going to continue - and
maybe moving to a monthly or bi-weekly schedule rather than the weekly slot
we have now.

In addition I need some willing folk to help with leadership in the SIG.
If you have an interest and would like to help please let me know!

I'd also like to better engage with all deployment projects - upgrades is
something that deployment tools should be looking to encapsulate as
features, so it would be good to get deployment projects engaged in the SIG
with nominated representatives.

Based on the attendance in upgrades sessions in Vancouver and
developer/operator appetite to discuss all things upgrade at said sessions
I'm assuming that there is still interest in having a SIG for Upgrades but
I may be wrong!

Thoughts?


As a SIG leader in a similar position (albeit with one other very
helpful person on board), let me throw my £0.02 in ...

With both upgrades and self-healing I think there is a big disparity
between supply (developers with time to work on the functionality) and
demand (operators who need the functionality).  And perhaps also the
high demand leads to a lot of developers being interested in the topic
whilst not having much spare time to help out.  That is probably why
we both see high attendance at the summit / PTG events but relatively
little activity in between.

I also freely admit that the inevitable conflicts with downstream
requirements mean that I have struggled to find time to be as
proactive with driving momentum as I had wanted, although I'm hoping
to pick this up again over the next weeks leading up to the PTG.  It
sounds like maybe you have encountered similar challenges.

That said, I strongly believe that both of these SIGs offer a *lot* of
value, and even if we aren't yet seeing the level of online activity
that we would like, I think it's really important that they both
continue.  If for no other reasons, the offline sessions at the
summits and PTGs are hugely useful for helping converge the community
on common approaches, and the associated repositories / wikis serve as
a great focal point too.

Regarding online collaboration, yes, building momentum for IRC
meetings is tough, especially with the timezone challenges.  Maybe a
monthly cadence is a reasonable starting point, or twice a month in
alternating timezones - but maybe with both meetings within ~24 hours
of each other, to reduce accidental creation of geographic silos.

Another possibility would be to offer "open clinic" office hours, like
the TC and other projects have done.  If the TC or anyone else has
established best practices in this space, it'd be great to hear them.

Either way, I sincerely hope that you decide to continue with the SIG,
and that other people step up to help out.  These things don't develop
overnight but it is a tremendously worthwhile initiative; after all,
everyone needs to upgrade OpenStack.  Keep the faith! ;-)

Cheers,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ptg] Self-healing SIG meeting moved to Thursday morning

2018-07-31 Thread Adam Spiers

Thierry Carrez  wrote:

Hi! Quick heads-up:

Following a request[1] from Adam Spiers (SIG lead), we modified the 
PTG schedule to move the Self-Healing SIG meeting from Friday (all 
day) to Thursday morning (only morning). You can see the resulting 
schedule at:


https://www.openstack.org/ptg#tab_schedule

Sorry for any inconvenience this may cause.


It's me who should be apologising - Thierry only deserves thanks for
accommodating my request at late notice ;-)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [self-healing] [ptg] [monasca] PTG track schedule published

2018-07-30 Thread Adam Spiers

Hi Witek,

Thanks a lot for the offer!  I've suggested to Thierry that Thursday 
morning probably works best, but if the room logistics don't permit 
that then we might have to accept your kind offer - I'll let you know. 


Cheers!
Adam

Bedyk, Witold  wrote: 

Hi Adam,

if nothing else works, we could probably offer you half-day of Monasca slot on Monday or Tuesday afternoon. I'm afraid though that our room might be too small for you. 


Cheers
Witek


-Original Message-
From: Thierry Carrez 
Sent: Freitag, 20. Juli 2018 18:46
To: Adam Spiers 
Cc: openstack-dev mailing list 
Subject: Re: [openstack-dev] [self-healing] [ptg] PTG track schedule 
published 

Adam Spiers wrote: 
Apologies - I have had to change plans and leave on the Thursday 
evening (old friend is getting married on Saturday morning).  Is there 
any chance of swapping the self-healing slot with one of the others? 


It's tricky, as you asked to avoid conflicts with API SIG, Watcher, Monasca, 
Masakari, and Mistral... Which day would be best for you given the current 
schedule (assuming we don't move anything else as it's too late for that). 


--
Thierry Carrez (ttx) 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [self-healing] [ptg] PTG track schedule published

2018-07-20 Thread Adam Spiers

Thierry Carrez  wrote:

Thierry Carrez wrote:

Hi everyone,

Last month we published the tentative schedule layout for the 5 days 
of PTG. There was no major complaint, so that was confirmed as the 
PTG event schedule and published on the PTG website:


https://www.openstack.org/ptg#tab_schedule


The tab temporarily disappeared, while it is being restored you can 
access the schedule at:


https://docs.google.com/spreadsheets/d/e/2PACX-1vRM2UIbpnL3PumLjRaso_9qpOfnyV9VrPqGbTXiMVNbVgjiR3SIdl8VSBefk339MhrbJO5RficKt2Rr/pubhtml?gid=1156322660=true


Apologies - I have had to change plans and leave on the Thursday
evening (old friend is getting married on Saturday morning).  Is there
any chance of swapping the self-healing slot with one of the others?

Sorry for having to ask!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tc] campaign question: How can we make contributing to OpenStack easier?

2018-04-26 Thread Adam Spiers

Doug Hellmann  wrote:

Excerpts from Adam Spiers's message of 2018-04-25 18:15:42 +0100:

[BTW I hope it's not considered off-bounds for those of us who aren't
TC election candidates to reply within these campaign question threads
to responses from the candidates - but if so, let me know and I'll
shut up ;-) ]


Everyone should feel free to participate!


Jeremy Stanley  wrote:

Not only are responses from everyone in the community welcome (and
like many, I think we should be asking questions like this often
outside the context of election campaigning), but I wholeheartedly
agree with your stance on this topic and also strongly encourage you
to consider running for a seat on the TC in the future if you can
swing it.


Thanks both for your support!

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tc] campaign question: How can we make contributing to OpenStack easier?

2018-04-25 Thread Adam Spiers

[BTW I hope it's not considered off-bounds for those of us who aren't
TC election candidates to reply within these campaign question threads
to responses from the candidates - but if so, let me know and I'll
shut up ;-) ]

Zhipeng Huang  wrote:

Culture wise, being too IRC-centric is definitely not helping, from my own
experience getting new Cyborg developer joining our weekly meeting from
China. Well we could always argue it is part of a open source/hacker
culture and preferable to commercial solutions that have the constant risk
of suddenly being shut down someday. But as OpenStack becomes more
commercialized and widely adopted, we should be aware that more and more
(potential) contributors will come from the groups who are used to
non-strictly open source environment, such as product develop team which
relies on a lot of "closed source" but easy to use softwares.

The change ? Use more video conferences, and more commercial tools that
preferred in certain region. Stop being allergic to non-open source
softwares and bring more capable but not hacker culture inclined
contributors to the community.


I respectfully disagree :-)


I know this is not a super welcomed stance in the open source hacker
culture. But if we want OpenStack to be able to sustain more developers and
not have a mid-life crisis then got fringed, we need to start changing the
hacker mindset.


I think that "the hacker mindset" is too ambiguous and generalized a
concept to be useful in framing justification for change.  From where
I'm standing, the hacker mindset is a wonderful and valuable thing
which should be preserved.

However, if that conflicts with other goals of our community, such as
reducing barrier to entry, then yes that is a valid concern.  In that
case we should examine in more detail the specific aspects of hacker
culture which are discouraging potential new contributors, and try to
fix those, rather than jumping to the assumption that we should
instead switch to commercial tools.  Given the community's "Four
Opens" philosophy and strong belief in the power of Open Source, it
would be inconsistent to abandon our preference for Open Source tools.

For example, proprietary tools such as Slack are not popular because
they are proprietary; they are popular because they have a very
intuitive interface and convenient features which people enjoy.  So
when examining the specific question "What can we do to make it easier
for OpenStack newbies to communicate with the existing community over
a public instant messaging system?", the first question should not be
"Should we switch to a proprietary tool?", but rather "Is there an
open source tool which provides enough of the functionality we need?"

And in fact in the case of instant messaging, I believe the answer is
yes, as I previously pointed out:

   http://lists.openstack.org/pipermail/openstack-sigs/2018-March/000332.html

Similarly, there are plenty of great Open Source solutions for voice
and video communications.

I'm all for changing with the times and adapting workflows to harness
the benefits of more modern tools, but I think it's wrong to
automatically assume that this can only be achieved via proprietary
solutions.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI][QA][HA][Eris][LCOO] Validating HA on upstream

2018-03-15 Thread Adam Spiers

Raoul Scarazzini  wrote:

On 15/03/2018 01:57, Ghanshyam Mann wrote:

Thanks all for starting the collaboration on this which is long pending
things and we all want to have some start on this.
Myself and SamP talked about it during OPS meetup in Tokyo and we talked
about below draft plan-
- Update the Spec - https://review.openstack.org/#/c/443504/. which is
almost ready as per SamP and his team is working on that.
- Start the technical debate on tooling we can use/reuse like Yardstick
etc, which is more this mailing thread. 
- Accept the new repo for Eris under QA and start at least something in
Rocky cycle.
I am in for having meeting on this which is really good idea. non-IRC
meeting is totally fine here. Do we have meeting place and time setup ?
-gmann


Hi Ghanshyam,
as I wrote earlier in the thread it's no problem for me to offer my
bluejeans channel, let's sort out which timeslice can be good. I've
added to the main etherpad [1] my timezone (line 53), let's do all that
so that we can create the meeting invite.

[1] https://etherpad.openstack.org/p/extreme-testing-contacts


Good idea!  I've added mine.  We're still missing replies from several
key stakeholders though (lines 62++) - probably worth getting buy-in
from a few more people before we organise anything.  I'm pinging a few
on IRC with reminders about this.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [self-healing] Dublin PTG summary, and request for feedback

2018-03-14 Thread Adam Spiers

Hi all,

I just posted a summary of the Self-healing SIG session at the Dublin
PTG:

  http://lists.openstack.org/pipermail/openstack-sigs/2018-March/000317.html

If you are interested in the topic of self-healing within OpenStack,
you are warmly invited to subscribe to the openstack-sigs mailing
list:

  http://lists.openstack.org/pipermail/openstack-sigs/

and/or join the #openstack-self-healing channel on Freenode IRC.

We are actively gathering feedback to help steer the SIG's focus in
the right direction, so all thoughts are very welcome, especially from
operators, since the primary goal of the SIG is to make life easier
for operators.

I have also just created an etherpad for brainstorming topics for the
Forum in Vancouver:

  https://etherpad.openstack.org/p/YVR-self-healing-brainstorming

Feel free to put braindumps in there :-)

Thanks!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI][QA][HA][Eris][LCOO] Validating HA on upstream

2018-03-08 Thread Adam Spiers

Raoul Scarazzini <ra...@redhat.com> wrote:

On 08/03/2018 17:03, Adam Spiers wrote:
[...]

Yes agreed again, this is a strong case for collaboration between the
self-healing and QA SIGs.  In Dublin we also discussed the idea of the
self-healing and API SIGs collaborating on the related topic of health
check APIs.


Guys, thanks a ton for your involvement in the topic, I am +1 to any
kind of meeting we can have to discuss this (like it was proposed by
Adam) so I'll offer my bluejeans channel for whatever kind of meeting we
want to organize.


Awesome, thanks - bluejeans would be great.


About the best practices part Georg was mentioning I'm 100% in
agreement, the testing methodologies are the first thing we need to care
about, starting from what we want to achieve.
That said, I'll keep studying Yardstick.

Hope to hear from you soon, and thanks again!


Yep - let's wait for people to catch up with the thread and hopefully
we'll get enough volunteers on

 https://etherpad.openstack.org/p/extreme-testing-contacts

for critical mass and then we can start discussing!  I think it's
especially important that we have the Eris folks on board since they
have already been working on this for a while.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI][QA][HA][Eris][LCOO] Validating HA on upstream

2018-03-08 Thread Adam Spiers
Georg Kunz  wrote: 

Hi Adam,

Raoul Scarazzini  wrote: 
In the meantime, I'll check yardstick to see which kind of bridge we 
can build to avoid reinventing the wheel. 


Great, thanks!  I wish I could immediately help with this, but I haven't had the 
chance to learn yardstick myself yet.  We should probably try to recruit 
someone from OPNFV to provide advice.  I've cc'd Georg who IIRC was the 
person who originally told me about yardstick :-)  He is an NFV expert and is 
also very interested in automated testing efforts: 

http://lists.openstack.org/pipermail/openstack-dev/2017-November/124942.html 

so he may be able to help with this architectural challenge. 


Thank you for bringing this up here. Better collaboration and sharing of knowledge, methodologies and tools across the communities is really what I'd like to see and facilitate. Hence, I am happy to help. 

I have already started to advertise the newly proposed QA SIG in the OPNFV test WG and I'll happily do the same for the self-healing SIG and any HA testing efforts in general. There is certainly some overlapping interest in these testing aspects between the QA SIG and the self-healing SIG and hence collaboration between both SIGs is crucial. 


That's fantastic - thank you so much! 

One remark regarding tools and frameworks: I consider the true value of a SIG to be a place for talking about methodologies and best practices: What do we need to test? What are the challenges? How can we approach this across communities? The tools and frameworks are important and we should investigate which tools are available, how good they are, how much they fit a given purpose, but at the end of the day they are tools meant to enable well designed testing methodologies. 


Agreed 100%. 


[snipped]

I'm beginning to think that maybe we should organise a video conference call 
to coordinate efforts between the various interested parties.  If there is 
appetite for that, the first question is: who wants to be involved?  To answer 
that, I have created an etherpad where interested people can sign up: 

https://etherpad.openstack.org/p/extreme-testing-contacts 

and I've cc'd people who I think would probably be interested.  Does this 
sound like a good approach? 


We discussed a very similar idea in Dublin in the context of the QA SIG. I very much like the idea of a cross-community, cross-team, and apparently even cross-SIG approach. 


Yes agreed again, this is a strong case for collaboration between the 
self-healing and QA SIGs.  In Dublin we also discussed the idea of the 
self-healing and API SIGs collaborating on the related topic of health 
check APIs. 


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [kolla] PTG Summary

2018-03-08 Thread Adam Spiers

Paul Bourke  wrote:

Hi all,

Here's my summary of the various topics we discussed during the PTG. 
There were one or two I had to step out for but hopefully this serves 
as an overall recap. Please refer to the main etherpad[0] for more 
details and links to the session specific pads.


[snipped]


self health check support
=
* This had some crossover with the monitoring discussion.
* Kolla has some checks in the form of our 'sanity checks', but these 
are underutilised and not implemented for every service. Tempest or 
rally would be a better fit here.


Actions:
* Remove the sanity check code from kolla-ansible - it's not fit for 
purpose and our assumption is noone is using it.
* Make contact with the self healing SIG, and see if we can help here. 
They may have recommendations for us.

* Make a spec for this.


[snipped]

Would be great to collaborate!  As the SIG is still new we don't have
regular meetings set up yet, but please join #openstack-self-healing
on IRC, and you can mail the openstack-sigs list with [self-healing]
in the subject.


Implement rolling upgrade for all core projects
===
* Started by defining the 'terms of engagement', i.e. what do we mean 
by rolling upgrade in kolla, what we currently have vs. what projects 
support, etc.
* There are two efforts under way here, 1) supporting online upgrade 
for all core projects that support it, 2) supporting FFU(offline) 
upgrade in Kolla.

* lujinluo is working on a way to do online FFU in Kolla.
* Testing - we need gates to test upgrade.

Actions:
* Finish implementation of rolling upgrade for all projects that 
support it in Rocky

* Improve documentation around this and upgrades in general for Kolla
* Spec in Rocky for FFU and associated efforts
* Begin looking at what would be required for upgrade gates in Kolla


Yes, a spec or other docs nailing down exactly what is meant by
rolling upgrade and FFU upgrade would be a great help.  I was in the
FFU session in Dublin and it felt to me like not everyone was on the
same page yet regarding the precise definitions, making it difficult
for all projects to move forward together in a coherent fashion.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI][QA][HA][Eris][LCOO] Validating HA on upstream

2018-03-07 Thread Adam Spiers

Raoul Scarazzini <ra...@redhat.com> wrote:

On 06/03/2018 13:27, Adam Spiers wrote:

Hi Raoul and all,
Sorry for joining this discussion late!

[...]

I do not work on TripleO, but I'm part of the wider OpenStack
sub-communities which focus on HA[0] and more recently,
self-healing[1].  With that hat on, I'd like to suggest that maybe
it's possible to collaborate on this in a manner which is agnostic to
the deployment mechanism.  There is an open spec on this>    
https://review.openstack.org/#/c/443504/
which was mentioned in the Denver PTG session on destructive testing
which you referenced[2].

[...]

   https://www.opnfv.org/community/projects/yardstick

[...]

Currently each sub-community and vendor seems to be reinventing HA
testing by itself to some extent, which is easier to accomplish in the
short-term, but obviously less efficient in the long-term.  It would
be awesome if we could break these silos down and join efforts! :-)


Hi Adam,
First of all thanks for your detailed answer. Then let me be honest
while saying that I didn't know yardstick.


Neither did I until Sydney, despite being involved with OpenStack HA
for many years ;-)  I think this shows that either a) there is room
for improved communication between the OpenStack and OPNFV
communities, or b) I need to take my head out of the sand more often ;-)


I need to start from scratch
here to understand what this project is. In any case, the exact meaning
of this thread is to involve people and have a more comprehensive look
at what's around.
The point here is that, as you can see from the tripleo-ha-utils spec
[1] I've created, the project is meant for TripleO specifically. On one
side this is a significant limitation, but on the other one, due to the
pluggable nature of the project, I think that integrations with other
software like you are proposing is not impossible.


Yep.  I totally sympathise with the tension between the need to get
something working quickly, vs. the need to collaborate with the
community in the most efficient way.


Feel free to add your comments to the review.


The spec looks great to me; I don't really have anything to add, and I
don't feel comfortable voting in a project which I know very little
about.


In the meantime, I'll check yardstick to see which kind of bridge we
can build to avoid reinventing the wheel.


Great, thanks!  I wish I could immediately help with this, but I
haven't had the chance to learn yardstick myself yet.  We should
probably try to recruit someone from OPNFV to provide advice.  I've
cc'd Georg who IIRC was the person who originally told me about
yardstick :-)  He is an NFV expert and is also very interested in
automated testing efforts:

   http://lists.openstack.org/pipermail/openstack-dev/2017-November/124942.html

so he may be able to help with this architectural challenge.

Also you should be aware that work has already started on Eris, the
extreme testing framework proposed in this user story:

   
http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/openstack_extreme_testing.html

and in the spec you already saw:

   https://review.openstack.org/#/c/443504/

You can see ongoing work here:

   https://github.com/LCOO/eris
   
https://openstack-lcoo.atlassian.net/wiki/spaces/LCOO/pages/13393034/Eris+-+Extreme+Testing+Framework+for+OpenStack

It looks like there is a plan to propose a new SIG for this, although
personally I would be very happy to see it adopted by the self-healing
SIG, since this framework is exactly what is needed for testing any
self-healing mechanism.

I'm hoping that Sampath and/or Gautum will chip in here, since I think
they're currently the main drivers for Eris.

I'm beginning to think that maybe we should organise a video
conference call to coordinate efforts between the various interested
parties.  If there is appetite for that, the first question is: who
wants to be involved?  To answer that, I have created an etherpad
where interested people can sign up:

   https://etherpad.openstack.org/p/extreme-testing-contacts

and I've cc'd people who I think would probably be interested.  Does
this sound like a good approach?

Cheers,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO][CI][QA][HA] Validating HA on upstream

2018-03-06 Thread Adam Spiers

Hi Raoul and all,

Sorry for joining this discussion late!

Raoul Scarazzini  wrote:

TL;DR: we would like to change the way HA is tested upstream to avoid
being hitten by evitable bugs that the CI process should discover.

Long version:

Today HA testing in upstream consist only in verifying that a three
controllers setup comes up correctly and can spawn an instance. That's
something, but it’s far from being enough since we continuously see "day
two" bugs.
We started covering this more than a year ago in internal CI and today
also on rdocloud using a project named tripleo-quickstart-utils [1].
Apart from his name, the project is not limited to tripleo-quickstart,
it covers three principal roles:

1 - stonith-config: a playbook that can be used to automate the creation
of fencing devices in the overcloud;
2 - instance-ha: a playbook that automates the seventeen manual steps
needed to configure instance HA in the overcloud, test them via rally
and verify that instance HA works;
3 - validate-ha: a playbook that runs a series of disruptive actions in
the overcloud and verifies it always behaves correctly by deploying a
heat-template that involves all the overcloud components;


Yes, a more rigorous approach to HA testing obviously has huge value,
not just for TripleO deployments, but also for any type of OpenStack
deployment.


To make this usable upstream, we need to understand where to put this
code. Here some choices:


[snipped]

I do not work on TripleO, but I'm part of the wider OpenStack
sub-communities which focus on HA[0] and more recently,
self-healing[1].  With that hat on, I'd like to suggest that maybe
it's possible to collaborate on this in a manner which is agnostic to
the deployment mechanism.  There is an open spec on this:

   https://review.openstack.org/#/c/443504/

which was mentioned in the Denver PTG session on destructive testing
which you referenced[2].

As mentioned in the self-healing SIG's session in Dublin[3], the OPNFV
community has already put a lot of effort into testing HA scenarios,
and it would be great if this work was shared across the whole
OpenStack community.  In particular they have a project called
Yardstick:

   https://www.opnfv.org/community/projects/yardstick

which contains a bunch of HA test cases:

   
http://docs.opnfv.org/en/latest/submodules/yardstick/docs/testing/user/userguide/15-list-of-tcs.html#h-a

Currently each sub-community and vendor seems to be reinventing HA
testing by itself to some extent, which is easier to accomplish in the
short-term, but obviously less efficient in the long-term.  It would
be awesome if we could break these silos down and join efforts! :-)

Cheers,
Adam

[0] #openstack-ha on Freenode IRC
[1] https://wiki.openstack.org/wiki/Self-healing_SIG
[2] https://etherpad.openstack.org/p/qa-queens-ptg-destructive-testing
[3] https://etherpad.openstack.org/p/self-healing-ptg-rocky

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [masakari] Any masakari folks at the PTG this week ?

2018-02-28 Thread Adam Spiers

My claim to being a masakari person is pretty weak, but still I'd like
to say hello too :-)  Please ping me (aspiers on IRC) if you guys are
meeting up!

Bhor, Dinesh  wrote:

Hi Greg,


We below are present:


Tushar Patil(tpatil)

Yukinori Sagara(sagara)

Abhishek Kekane(abhishekk)

Dinesh Bhor(Dinesh_Bhor)


Thank you,

Dinesh Bhor



From: Waines, Greg 
Sent: 28 February 2018 19:22:26
To: OpenStack Development Mailing List (not for usage questions)
Subject: [openstack-dev] [masakari] Any masakari folks at the PTG this week ?


Any masakari folks at the PTG this week ?



Would be interested in meeting up and chatting,

let me know,

Greg.

__
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [self-healing][PTG] etherpad for PTG session on self-healing

2018-02-22 Thread Adam Spiers

Hi all,

Yushiro kindly created an etherpad for the self-healing SIG session at
the Dublin PTG on Tuesday afternoon next week, and I've fleshed it out
a bit:

   https://etherpad.openstack.org/p/self-healing-ptg-rocky

Anyone with an interest in self-healing is of course very welcome to
attend (or keep an eye on it remotely!)  This SIG is still very young,
so it's a great chance for you to shape the direction it takes :-)  If
you are able to attend, please add your name, and also feel free to
add topics which you would like to see covered.

It would be particularly helpful if operators could participate and
share their experiences of what is or isn't (yet!) working with
self-healing in OpenStack, so that those of us on the development side
can aim to solve the right problems :-)

Thanks, and see some of you in Dublin!

Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] Etherpad for self-healing

2018-02-22 Thread Adam Spiers

Furukawa, Yushiro  wrote:

Hi everyone,

I am seeing Self-healing scheduled on Tuesday afternoon[1], but the etherpad 
for it is not listed in [2].
I made following etherpad by some chance.


Thanks!  You beat me to it ;-)


Would it be possible to update Etherpads wiki page?


Done.


https://etherpad.openstack.org/p/self-healing-ptg-rocky


I'm also adding some more ideas for topics to the etherpad, and then
I'll (re-)announce it here and also to openstack-{operators,sigs} to
promote visibility.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] Remembering Shawn Pearce (fwd)

2018-02-02 Thread Adam Spiers

Dear Stackers,

Since git and Gerrit are at the heart of our development process, I am
passing on this very sad news from the git / Gerrit communities that
Shawn Pearce has passed away after an aggressive lung cancer.

Shawn was founder of Gerrit / JGit / libgit2 / git-gui, and the third
most prolific contributor to git itself.

   https://gitenterprise.me/2018/01/30/shawn-pearce-a-true-leader/
   https://sfconservancy.org/blog/2018/jan/30/shawn-pearce/
   https://twitter.com/cdibona/status/957822400518696960
   
https://public-inbox.org/git/CAP8UFD0aKqT5YXJx9-MqeKCKhOVGxninRf8tv30=hkgvmhg...@mail.gmail.com/T/#mf5c158c68565c1c68c80b6543966ef2cad6d151c
   https://groups.google.com/forum/#!topic/repo-discuss/B4P7G1YirdM/discussion

He is survived by his wife and two young sons.  A memorial fund has
been set up in aid of the boys' education and future:

   https://gitenterprise.me/2018/01/30/gerrithub-io-donations-to-shawns-family/

Thank you Shawn for enriching our lives with your great contributions
to the FLOSS community.

- Forwarded message from Adam Spiers <aspi...@suse.com> -

Date: Fri, 2 Feb 2018 15:12:35 +
From: Adam Spiers <aspi...@suse.com>
To: Luca Milanesio <l...@gerritforge.com>
Subject: Re: Fwd: Remembering Shawn Pearce

Hi Luca, that's such sad news :-(  What an incredible contribution
Shawn made to the community.  In addition to Gerrit, I use git-gui and
gitk regularly, and also my git-deps utility is based on libgit2.  I
had no idea he wrote them all, and many other things.

I will certainly donate and also ensure that the OpenStack community
is aware of the memorial fund.  Thanks a lot for letting me know!

Luca Milanesio <l...@gerritforge.com> wrote:

Hi Adam,
you probably have received this very sad news :-(
As GerritForge we are actively supporting, contributing and promoting the donations 
to Shawn's Memorial Fund (https://www.gofundme.com/shawn-pearce-memorial-fund) and 
added a donation button to GerritHub.io <http://gerrithub.io/>.

Feel free to spread the sad news to the OpenStack community you are in touch 
with.
---
Luca Milanesio
GerritForge
3rd Fl. 207 Regent Street
London W1B 3HH - UK
http://www.gerritforge.com <http://www.gerritforge.com/>

l...@gerritforge.com <mailto:l...@gerritforge.com>
Tel:  +44 (0)20 3292 0677
Mob: +44 (0)792 861 7383
Skype: lucamilanesio
http://www.linkedin.com/in/lucamilanesio 
<http://www.linkedin.com/in/lucamilanesio>

> Begin forwarded message:
> 
> From: "'Dave Borowitz' via Repo and Gerrit Discussion" <repo-disc...@googlegroups.com>

> Subject: Remembering Shawn Pearce
> Date: 29 January 2018 at 15:15:05 GMT
> To: repo-discuss <repo-disc...@googlegroups.com>
> Reply-To: Dave Borowitz <dborow...@google.com>
> 
> Dear Gerrit community,
> 
> I am very saddened to report that Shawn Pearce, long-time Git contributor and founder of the Gerrit Code Review project, passed away over the weekend after being diagnosed with lung cancer last year. He spent his final days comfortably in his home, surrounded by family, friends, and colleagues.
> 
> Shawn was an exceptional software engineer and it is impossible to overstate his contributions to the Git ecosystem. He had everything from the driving high-level vision to the coding skills to solve any complex problem and bring his vision to reality. If you had the pleasure of collaborating with him on code reviews, as I know many of you did, you've seen first-hand his dedication and commitment to quality. You can read more about his contributions in this recent interview <https://git.github.io/rev_news/2017/08/16/edition-30/#developer-spotlight-shawn-pearce>.
> 
> In addition to his technical contributions, Shawn truly loved the open-source communities he was a part of, and the Gerrit community in particular. Growing the Gerrit project from nothing to a global community with hundreds of contributors used by some of the world's most prominent tech companies is something he was extremely proud of.
> 
> Please join me in remembering Shawn Pearce and continuing his legacy. Feel free to use this thread to share your memories with the community Shawn loved.
> 
> If you are interested, his family has set up GoFundMe page <https://www.gofundme.com/shawn-pearce-memorial-fund> to put towards his children's future.
> 
> Best wishes,

> Dave Borowitz
> 
> 
> --

> --
> To unsubscribe, email repo-discuss+unsubscr...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en 
<http://groups.google.com/group/repo-discuss?hl=en>
> 
> ---

> You received this message because you are subscribed to the Google Groups "Repo 
and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to 
repo-discuss+unsubscr...@

[openstack-dev] ANNOUNCE: Self-healing SIG officially formed (fwd)

2017-11-27 Thread Adam Spiers

As per below, I'm happy to announce that the Self-healing SIG is now
officially formed.  For now, all discussions will happen on
<openstack-s...@lists.openstack.org>, so please subscribe to that list
if you are interested in this topic!

Cheers,
Adam

- Forwarded message from Adam Spiers <aspi...@suse.com> -

Date: Mon, 27 Nov 2017 14:19:25 +0000
From: Adam Spiers <aspi...@suse.com>
To: OpenStack SIGs list <openstack-s...@lists.openstack.org>
Subject: [Openstack-sigs] [meta] [self-healing] ANNOUNCE: Self-healing SIG 
officially formed
Reply-To: openstack-s...@lists.openstack.org

TL;DR: the self-healing SIG is officially formed!  Watch the openstack-sigs 
mailing list for future developments.


A longer version of this announcement can be found at

  https://blog.adamspiers.org/2017/11/24/announcing-openstacks-self-healing-sig/


A SIG is born!
--

After an unofficial kick-off meeting at the last PTG in Denver, I proposed the 
creation of a new self-healing SIG:


  http://lists.openstack.org/pipermail/openstack-sigs/2017-September/54.html

At the recent Summit in Sydney, we had a Forum session around 30 people attend 
the Sydney Forum session, which was extremely encouraging! You can read more 
details in the etherpad, but here is the quick summary ...


Most importantly, we resolved the naming and scoping issues, concluding that 
to avoid biting off too much in one go, it was better to be pragmatic and 
start small:


- Initially focus on cloud infrastructure, and not worry too much
  about the user-facing impact of failures yet; we can add that
  concern whenever it makes sense (which is particularly relevant
  for telcos / NFV).

- Not worry too much about optimization initially; Watcher is
  possibly the only project focusing on this right now, and again we
  can expand to include optimization any time we want.

Now that the naming and scoping issues are resolved, I am excited to announce 
that the Self-healing SIG is officially formed!


Discussion went beyond mere administravia, however:

- We collected a few initial use cases.

- We informally decided the governance of the SIG. I asked if anyone
  else would like to assume leadership, but noone seemed keen,
  dashing my hopes of avoiding extra work ;-)  But Eric Kao, PTL of
  Congress, generously offered to act as co-chair.

- We discussed health check APIs, which were mentioned in at least 2
  or 3 other Forum sessions this time round.

- We agreed that we wanted an IRC channel, and that it could host
  bi-weekly meetings. However as usual there was no clean solution
  to choosing a time which would suit everyone ;-/  I'll try to
  figure out what to do about this!


Get involved


You are warmly invited to join, if this topic interests you:

- Ensure you are subscribed to the openstack-sigs mailing list:

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-sigs

  and watch out watch out for posts tagged =[self-healing]=.

- Bookmark https://wiki.openstack.org/wiki/Self-healing_SIG
  which is the SIG's official home.


Next steps
--

I will set up the IRC channel, and see if we can make progress on agreeing 
times for regular IRC meetings.


Other than this administravia, it is of course up to the community to decide 
in which direction the SIG should go, but my suggestions are:


- Continue to collect use cases.  It makes sense to have a very
  lightweight process for this (at least, initially), so Eric has
  created a Google Doc and populated it with a suggested template and
  a first example:


https://docs.google.com/document/d/13N36g2RlUYs8mw7hbfRXw6y2Jc-V2XGrXgfPXPpUvuU/edit?usp=sharing

  Feel free to add your own based on this template.

- Collect links to any existing documentation or other resources which
  describe how existing services can be combined.  This awesome talk
  on Advanced Fault Management with Vitrage and Mistral is a perfect
  example:

  
https://www.openstack.org/videos/sydney-2017/advanced-fault-management-with-vitrage-and-mistral

  and here is another:

  
https://www.openstack.org/videos/barcelona-2016/building-self-healing-applications-with-aodh-zaqar-and-mistral

  but we need to make it easier for operators to understand which
  combinations like this are possible, and easier for them to be set
  up.

- Finish the architecture diagram drafted in Denver:


https://docs.google.com/drawings/d/1kEFtVpQ4c8HipSp34EVAkcSGmwyg1MzWf_H5oGTtl-Y/edit?usp=sharing

- At a higher level, we could document reference stacks which address
  multiple self-healing cases.

- Talk more with the OPNFV community to find out what capabilities
  they have which could be reused within non-NFV OpenStack clouds.

- Perform gaps analysis on the use cases, and liase with specific
  projects to drive development in directions which can address those
  gaps.

___
openstack-sigs mailing list
openstack-

Re: [openstack-dev] [all][deployment][kolla][tripleo][osa] Service diagnostics task force

2017-11-07 Thread Adam Spiers

Michał Jastrzębski  wrote:

Hello my dearest of communities,

During deployment tools session on PTG we discussed need for deep
health checking and metering of running services. It's very relevant
in context of containers (especially k8s) and HA. Things like
watchdog, heartbeats or exposing relative metrics (I don't want to get
into design here, suffice to say it's non-trivial problem to solve).

We would like to start a "task force", few volunteers from both
deployment tool side (ops, ha) and project dev teams. We would like to
design together a proper health check mechanism for one of projects to
create best practices and design, that later could be implemented in
all other services.

We would to ask for volunteer project team to join us and spearhead this effort.


Sorry for the late reply - I only just found this thread.  But I would
certainly like to be involved too :-)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Openstack-sigs] [meta] Proposal for self-healing SIG (fwd)

2017-09-17 Thread Adam Spiers
Hi everyone,

As per below, I've just proposed the creation of a new SIG.  Feedback
is very welcome - ideally it would all be collected in the same thread
I started over on the openstack-sigs list, but feedback in two places
is more useful than nowhere, so I'll keep an eye out here too ;-)

Thanks!
Adam

- Forwarded message from Adam Spiers <aspi...@suse.com> -

Date: Sun, 17 Sep 2017 23:35:02 +0100
From: Adam Spiers <aspi...@suse.com>
To: OpenStack SIGs list <openstack-s...@lists.openstack.org>
Subject: [Openstack-sigs] [meta] Proposal for self-healing SIG

Hi all, 

[TL;DR: we want to set up a "self-healing infrastructure" SIG.] 

One of the biggest promises of the cloud vision was the idea that all 
the infrastructure could be managed in a policy-driven fashion, 
reacting to failures and other events by automatically healing and 
optimising services.  Most of the components required to implement 
such an architecture already exist, e.g. 

  - Monasca: Monitoring
  - Aodh: Alarming
  - Congress: Policy-based governance
  - Mistral: Workflow
  - Senlin: Clustering
  - Vitrage: Root Cause Analysis
  - Watcher: Optimization
  - Masakari: Compute plane HA
  - Freezer-dr: DR and compute plane HA

However, there is not yet a clear strategy within the community for 
how these should all tie together. 

So at the PTG last week in Denver, we held an initial cross-project 
meeting to discuss this topic.[0]  It was well-attended, with 
representation from almost all of the relevant projects, and it felt 
like a very productive session to me.  I shall do my best to summarise 
whilst trying to avoid any misrepresentation ...

There was general agreement that the following actions would be 
worthwhile: 

  - Document reference stacks describing what use cases can already be
addressed with the existing projects.  (Even better if some of
these stacks have already been tested in the wild.)

  - Document what integrations between the projects already exist at a
technical level.  (We actually began this during the meeting, by
placing the projects into phases of a high-level flow, and then
collaboratively building a Google Drawing to show that.[1])

  - Collect real-world use cases from operators, including ones which
they would like to accomplish but cannot yet.

  - From the above, perform gaps analysis to help shape the future
direction of these projects, e.g. through specs targetting those
gaps.

  - Perform overlap analysis to help ensure that the projects are
correctly scoped and integrate well without duplicating any
significant effort.[2]

  - Set up a SIG[3] to promote further discussion across the projects
and with operators.  I talked to Thierry afterwards, and
consequently this email is the first step on that path :-)

  - Allocate the SIG a mailing list prefix - "[self-healing]" or
similar.

  - Set up a bi-weekly IRC meeting for the SIG.

  - Continue the discussion at the Sydney Forum, since it's an ideal
opportunity to get developers and operators together and decide
what the next steps should be.

  - Continue the discussion at the next Ops meetup in Tokyo.

I got coerced^Wvolunteered to drive the next steps ;-)  So far I 
have created an etherpad proposing the Forum session[4], and added it 
to the Forum wiki page[5].  I'll also add it to the SIG wiki page[6]. 

There were things we did not reach a concrete conclusion on: 

  - What should the SIG be called?  We felt that "self-healing" was
pretty darn close to capturing the intent of the topic.  However
as a natural pedant, I couldn't help but notice that technically
speaking, that would most undesirably exclude Watcher, because the
optimization it provides isn't *quite* "healing" - the word
"healing" implies that something is sick, and optimization can be
applied even when the cloud is perfectly healthy.  Any suggestions
for a name with a marginally wider scope would be gratefully
received.

  - Should the SIG be scoped to only focus on self-healing (and
self-optimization) of OpenStack infrastructure, or should it also
include self-healing of workloads?  My feeling is that we should
keep it scoped to the infrastructure which falls under the
responsibility of the cloud operators; anything user-facing would
be very different from a process perspective.

  - How should the SIG's governance be set up?  Unfortunately it
didn't occur to me to raise this question during the discussion,
but I've since seen that the k8s SIG managed to make some
decisions in this regard[7], and stealing their idea of a PTL-type
model with a minimum of 2 chairs sounds good to me.

  - Which timezone the IRC meeting should be in?  As usual, there were
interested parties from all the usual continents, so no one time
would suit everyone.  I guess I can just submit a review

Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-09-14 Thread Adam Spiers
Hi Ken,

Thanks a lot for the analysis, and sorry for the slow reply!
Comments inline...

Ken Giusti <kgiu...@gmail.com> wrote:
> Hi Adam,
> 
> I think there's a couple of problems here.
> 
> Regardless of worker count, the service.wait() is called before
> service.start().  And from looking at the oslo.service code, the 'wait()'
> method is call after start(), then again after stop().  This doesn't match
> up with the intended use of oslo.messaging.server.wait(), which should only
> be called after .stop().

Hmm, so are you saying that there might be a bug in oslo.service's
usage of oslo.messaging, and that this Sahara bugfix was the wrong
approach too?

https://review.openstack.org/#/c/280741/1/sahara/cli/sahara_engine.py

> Perhaps a bigger issue is that in the multi threaded case all threads
> appear to be calling start, wait, and stop on the same instance of the
> service (oslo.messaging rpc server).  At least that's what I'm seeing in my
> muchly reduced test code:
> 
> https://paste.fedoraproject.org/paste/-73zskccaQvpSVwRJD11cA
> 
> The log trace shows multiple calls to start, wait, stop via different
> threads to the same TaskServer instance:
> 
> https://paste.fedoraproject.org/paste/dyPq~lr26sQZtMzHn5w~Vg
> 
> Is that expected?

Unfortunately in the interim, your pastes seem to have vanished - any
chance you could repaste them?

Thanks,
Adam

> On Mon, Jul 31, 2017 at 9:32 PM, Adam Spiers <aspi...@suse.com> wrote:
> > Ken Giusti <kgiu...@gmail.com> wrote:
> >> On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspi...@suse.com> wrote:
> >>> I recently discovered a bug where barbican-worker would hang on
> >>> shutdown if queue.asynchronous_workers was changed from 1 to 2:
> >>>
> >>>https://bugs.launchpad.net/barbican/+bug/1705543
> >>>
> >>> resulting in a warning like this:
> >>>
> >>>WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
> >>> start to complete
> >>>
> >>> I found a similar bug in Sahara:
> >>>
> >>>https://bugs.launchpad.net/sahara/+bug/1546119
> >>>
> >>> where the fix was to call start() on the RPC service before making the
> >>> launcher wait() on it, so I ported the fix to Barbican, and it seems
> >>> to work fine:
> >>>
> >>>https://review.openstack.org/#/c/485755
> >>>
> >>> I noticed that both projects use ProcessLauncher; barbican uses
> >>> oslo_service.service.launch() which has:
> >>>
> >>>if workers is None or workers == 1:
> >>>launcher = ServiceLauncher(conf, restart_method=restart_method)
> >>>else:
> >>>launcher = ProcessLauncher(conf, restart_method=restart_method)
> >>>
> >>> However, I'm not an expert in oslo.service or oslo.messaging, and one
> >>> of Barbican's core reviewers (thanks Kaitlin!) noted that not many
> >>> other projects start the task before calling wait() on the launcher,
> >>> so I thought I'd check here whether that is the correct fix, or
> >>> whether there's something else odd going on.
> >>>
> >>> Any oslo gurus able to shed light on this?
> >>>
> >>
> >> As far as an oslo.messaging server is concerned, the order of operations
> >> is:
> >>
> >> server.start()
> >> # do stuff until ready to stop the server...
> >> server.stop()
> >> server.wait()
> >>
> >> The final wait blocks until all requests that are in progress when stop()
> >> is called finish and cleanup.
> >
> > Thanks - that makes sense.  So the question is, why would
> > barbican-worker only hang on shutdown when there are multiple workers?
> > Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
> > and it's not calling start() correctly?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-07-31 Thread Adam Spiers

Ken Giusti <kgiu...@gmail.com> wrote:

On Mon, Jul 31, 2017 at 10:01 AM, Adam Spiers <aspi...@suse.com> wrote:

I recently discovered a bug where barbican-worker would hang on
shutdown if queue.asynchronous_workers was changed from 1 to 2:

   https://bugs.launchpad.net/barbican/+bug/1705543

resulting in a warning like this:

   WARNING oslo_messaging.server [-] Possible hang: stop is waiting for
start to complete

I found a similar bug in Sahara:

   https://bugs.launchpad.net/sahara/+bug/1546119

where the fix was to call start() on the RPC service before making the
launcher wait() on it, so I ported the fix to Barbican, and it seems
to work fine:

   https://review.openstack.org/#/c/485755

I noticed that both projects use ProcessLauncher; barbican uses
oslo_service.service.launch() which has:

   if workers is None or workers == 1:
   launcher = ServiceLauncher(conf, restart_method=restart_method)
   else:
   launcher = ProcessLauncher(conf, restart_method=restart_method)

However, I'm not an expert in oslo.service or oslo.messaging, and one
of Barbican's core reviewers (thanks Kaitlin!) noted that not many
other projects start the task before calling wait() on the launcher,
so I thought I'd check here whether that is the correct fix, or
whether there's something else odd going on.

Any oslo gurus able to shed light on this?


As far as an oslo.messaging server is concerned, the order of operations is:

server.start()
# do stuff until ready to stop the server...
server.stop()
server.wait()

The final wait blocks until all requests that are in progress when stop()
is called finish and cleanup.


Thanks - that makes sense.  So the question is, why would
barbican-worker only hang on shutdown when there are multiple workers?
Maybe the real bug is somewhere in oslo_service.service.ProcessLauncher
and it's not calling start() correctly?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [oslo][barbican][sahara] start RPC service before launcher wait?

2017-07-31 Thread Adam Spiers

Hi all,

I recently discovered a bug where barbican-worker would hang on
shutdown if queue.asynchronous_workers was changed from 1 to 2:

   https://bugs.launchpad.net/barbican/+bug/1705543

resulting in a warning like this:

   WARNING oslo_messaging.server [-] Possible hang: stop is waiting for start 
to complete

I found a similar bug in Sahara:

   https://bugs.launchpad.net/sahara/+bug/1546119

where the fix was to call start() on the RPC service before making the
launcher wait() on it, so I ported the fix to Barbican, and it seems
to work fine:

   https://review.openstack.org/#/c/485755

I noticed that both projects use ProcessLauncher; barbican uses
oslo_service.service.launch() which has:

   if workers is None or workers == 1:
   launcher = ServiceLauncher(conf, restart_method=restart_method)
   else:
   launcher = ProcessLauncher(conf, restart_method=restart_method)

However, I'm not an expert in oslo.service or oslo.messaging, and one
of Barbican's core reviewers (thanks Kaitlin!) noted that not many
other projects start the task before calling wait() on the launcher,
so I thought I'd check here whether that is the correct fix, or
whether there's something else odd going on.

Any oslo gurus able to shed light on this?

Thanks!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] [masakari] VM Heartbeat / Healthcheck Monitoring

2017-05-17 Thread Adam Spiers

I don't see any reason why masakari couldn't handle that, but you'd
have to ask Sampath and the masakari team whether they would consider
that in scope for their roadmap.

Waines, Greg <greg.wai...@windriver.com> wrote:

Sure.  I can propose a new user story.

And then are you thinking of including this user story in the scope of what 
masakari would be looking at ?

Greg.


From: Adam Spiers <aspi...@suse.com>
Reply-To: "openstack-dev@lists.openstack.org" 
<openstack-dev@lists.openstack.org>
Date: Wednesday, May 17, 2017 at 10:08 AM
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Thanks for the clarification Greg.  This sounds like it has the
potential to be a very useful capability.  May I suggest that you
propose a new user story for it, along similar lines to this existing
one?

http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html

Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> 
wrote:
Yes that’s correct.
VM Heartbeating / Health-check Monitoring would introduce intrusive / white-box 
type monitoring of VMs / Instances.

I realize this is somewhat in the gray-zone of what a cloud should be 
monitoring or not,
but I believe it provides an alternative for Applications deployed in VMs that 
do not have an external monitoring/management entity like a VNF Manager in the 
MANO architecture.
And even for VMs with VNF Managers, it provides a highly reliable alternate 
monitoring path that does not rely on Tenant Networking.

You’re correct, that VM HB/HC Monitoring would leverage
https://wiki.libvirt.org/page/Qemu_guest_agent
that would require the agent to be installed in the images for talking back to 
the compute host.
( there are other examples of similar approaches in openstack ... the 
murano-agent for installation, the swift-agent for object store management )
Although here, in the case of VM HB/HC Monitoring, via the QEMU Guest Agent, 
the messaging path is internal thru a QEMU virtual serial device.  i.e. a very 
simple interface with very few dependencies ... it’s up and available very 
early in VM lifecycle and virtually always up.

Wrt failure modes / use-cases

· a VM’s response to a Heartbeat Challenge Request can be as simple as 
just ACK-ing,
this alone allows for detection of:

oa failed or hung QEMU/KVM instance, or

oa failed or hung VM’s OS, or

oa failure of the VM’s OS to schedule the QEMU Guest Agent daemon, or

oa failure of the VM to route basic IO via linux sockets.

· I have had feedback that this is similar to the virtual hardware 
watchdog of QEMU/KVM ( https://libvirt.org/formatdomain.html#elementsWatchdog )

· However, the VM Heartbeat / Health-check Monitoring

o   provides a higher-level (i.e. application-level) heartbeating

§  i.e. if the Heartbeat requests are being answered by the Application running 
within the VM

o   provides more than just heartbeating, as the Application can use it to 
trigger a variety of audits,

o   provides a mechanism for the Application within the VM to report a Health 
Status / Info back to the Host / Cloud,

o   provides notification of the Heartbeat / Health-check status to 
higher-level cloud entities thru Vitrage

§  e.g.   VM-Heartbeat-Monitor - to - Vitrage - (EventAlarm) - Aodh - ... - 
VNF-Manager
   
- (StateChange) - Nova - ... - VNF Manager


Greg.


From: Adam Spiers <aspi...@suse.com<mailto:aspi...@suse.com>>
Reply-To: "openstack-dev@lists.openstack.org<mailto:openstack-dev@lists.openstack.org>" 
<openstack-dev@lists.openstack.org<mailto:openstack-dev@lists.openstack.org>>
Date: Tuesday, May 16, 2017 at 7:29 PM
To: "openstack-dev@lists.openstack.org<mailto:openstack-dev@lists.openstack.org>" 
<openstack-dev@lists.openstack.org<mailto:openstack-dev@lists.openstack.org>>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Waines, Greg 
<greg.wai...@windriver.com<mailto:greg.wai...@windriver.com><mailto:greg.wai...@windriver.com><mailto:greg.wai...@windriver.com%3e>>
 wrote:
thanks for the pointers Sam.

I took a quick look.
I agree that the VM Heartbeat / Health-check looks like a good fit into 
Masakari.

Currently your instance monitoring looks like it is strictly black-box type 
monitoring thru libvirt events.
Is that correct ?
i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU 
Guest Agent facility
  correct ?

That is correct:

https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py

I think this is what VM Heartbeat / Health-check would add to Masaraki.
Let me kn

Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-17 Thread Adam Spiers

Thanks for the clarification Greg.  This sounds like it has the
potential to be a very useful capability.  May I suggest that you
propose a new user story for it, along similar lines to this existing
one?

http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html

Waines, Greg <greg.wai...@windriver.com> wrote:

Yes that’s correct.
VM Heartbeating / Health-check Monitoring would introduce intrusive / white-box 
type monitoring of VMs / Instances.

I realize this is somewhat in the gray-zone of what a cloud should be 
monitoring or not,
but I believe it provides an alternative for Applications deployed in VMs that 
do not have an external monitoring/management entity like a VNF Manager in the 
MANO architecture.
And even for VMs with VNF Managers, it provides a highly reliable alternate 
monitoring path that does not rely on Tenant Networking.

You’re correct, that VM HB/HC Monitoring would leverage
https://wiki.libvirt.org/page/Qemu_guest_agent
that would require the agent to be installed in the images for talking back to 
the compute host.
( there are other examples of similar approaches in openstack ... the 
murano-agent for installation, the swift-agent for object store management )
Although here, in the case of VM HB/HC Monitoring, via the QEMU Guest Agent, 
the messaging path is internal thru a QEMU virtual serial device.  i.e. a very 
simple interface with very few dependencies ... it’s up and available very 
early in VM lifecycle and virtually always up.

Wrt failure modes / use-cases

· a VM’s response to a Heartbeat Challenge Request can be as simple as 
just ACK-ing,
this alone allows for detection of:

oa failed or hung QEMU/KVM instance, or

oa failed or hung VM’s OS, or

oa failure of the VM’s OS to schedule the QEMU Guest Agent daemon, or

oa failure of the VM to route basic IO via linux sockets.

· I have had feedback that this is similar to the virtual hardware 
watchdog of QEMU/KVM ( https://libvirt.org/formatdomain.html#elementsWatchdog )

· However, the VM Heartbeat / Health-check Monitoring

o   provides a higher-level (i.e. application-level) heartbeating

§  i.e. if the Heartbeat requests are being answered by the Application running 
within the VM

o   provides more than just heartbeating, as the Application can use it to 
trigger a variety of audits,

o   provides a mechanism for the Application within the VM to report a Health 
Status / Info back to the Host / Cloud,

o   provides notification of the Heartbeat / Health-check status to 
higher-level cloud entities thru Vitrage

§  e.g.   VM-Heartbeat-Monitor - to - Vitrage - (EventAlarm) - Aodh - ... - 
VNF-Manager
   
- (StateChange) - Nova - ... - VNF Manager


Greg.


From: Adam Spiers <aspi...@suse.com>
Reply-To: "openstack-dev@lists.openstack.org" 
<openstack-dev@lists.openstack.org>
Date: Tuesday, May 16, 2017 at 7:29 PM
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> 
wrote:
thanks for the pointers Sam.

I took a quick look.
I agree that the VM Heartbeat / Health-check looks like a good fit into 
Masakari.

Currently your instance monitoring looks like it is strictly black-box type 
monitoring thru libvirt events.
Is that correct ?
i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU 
Guest Agent facility
  correct ?

That is correct:

https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py

I think this is what VM Heartbeat / Health-check would add to Masaraki.
Let me know if you agree.

OK, so you are looking for something slightly different I guess, based
on this QEMU guest agent?

   https://wiki.libvirt.org/page/Qemu_guest_agent

That would require the agent to be installed in the images, which is
extra work but I imagine quite easily justifiable in some scenarios.
What failure modes do you have in mind for covering with this
approach - things like the guest kernel freezing, for instance?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___

Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-17 Thread Adam Spiers

Yep :-)  That's pretty much exactly what I was suggesting elsewhere in
this thread:

http://lists.openstack.org/pipermail/openstack-dev/2017-May/116748.html

Waines, Greg <greg.wai...@windriver.com> wrote:

Excellent.
Yeah I just watched your Boston Summit presentation and noticed, at least when 
you were talking about host-monitoring, you were looking at having alternative 
backends for reporting e.g. to masakari-api or to mistral or ... to Vitrage :)

Greg.

From: Adam Spiers <aspi...@suse.com>
Reply-To: "openstack-dev@lists.openstack.org" 
<openstack-dev@lists.openstack.org>
Date: Tuesday, May 16, 2017 at 7:42 PM
To: "openstack-dev@lists.openstack.org" <openstack-dev@lists.openstack.org>
Subject: Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck 
Monitoring

Waines, Greg <greg.wai...@windriver.com<mailto:greg.wai...@windriver.com>> 
wrote:
Sam,

Two other more higher-level points I wanted to discuss with you about Masaraki.


First,
so I notice that you are doing both monitoring, auto-recovery and even host 
maintenance
type functionality as part of the Masaraki architecture.

are you open to some configurability (enabling/disabling) of these capabilities 
?

I can't speak for Sampath or the Masakari developers, but the monitors
are standalone components.  Currently they can only send notifications
in a format which the masakari-api service can understand, but I guess
it wouldn't be hard to extend them to send notifications in other
formats if that made sense.

e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault 
events
 get reported to Vitrage ... and eventually filter up to 
Aodh Alarms that get
 received by VNFManagers which would be responsible for the 
recovery.

e.g. some deployers of openstack might want to disable parts or all of your 
monitoring,
if using other mechanisms such as Zabbix or Nagios for the host 
monitoring (say)

Yes, exactly!  This kind of configurability and flexibility which
would allow each cloud architect to choose which monitoring / alerting
/ recovery components suit their requirements best in a "mix'n'match"
fashion, is exactly what we are aiming for with our modular approach
to the design of compute plane HA.  If the various monitoring
components adopt a driver-based approach to alerting and/or the
ability to alert via a lowest common denominator format such as simple
HTTP POST of JSON blobs, then it should be possible for each cloud
deployer to integrate the monitors with whichever reporting dashboards
/ recovery workflow controllers best satisfy their requirements.

Second, are you open to configurably having fault events reported to
Vitrage ?

Again I can't speak on behalf of the Masakari project, but this sounds
like a great idea to me :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 
openstack-dev-requ...@lists.openstack.org<mailto:openstack-dev-requ...@lists.openstack.org>?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-16 Thread Adam Spiers

Waines, Greg  wrote:

Sam,

Two other more higher-level points I wanted to discuss with you about Masaraki.


First,
so I notice that you are doing both monitoring, auto-recovery and even host 
maintenance
type functionality as part of the Masaraki architecture.

are you open to some configurability (enabling/disabling) of these capabilities 
?


I can't speak for Sampath or the Masakari developers, but the monitors
are standalone components.  Currently they can only send notifications
in a format which the masakari-api service can understand, but I guess
it wouldn't be hard to extend them to send notifications in other
formats if that made sense.


e.g. OPNFV guys would NOT want auto-recovery, they would prefer that fault 
events
 get reported to Vitrage ... and eventually filter up to 
Aodh Alarms that get
 received by VNFManagers which would be responsible for the 
recovery.

e.g. some deployers of openstack might want to disable parts or all of your 
monitoring,
if using other mechanisms such as Zabbix or Nagios for the host 
monitoring (say)


Yes, exactly!  This kind of configurability and flexibility which
would allow each cloud architect to choose which monitoring / alerting
/ recovery components suit their requirements best in a "mix'n'match"
fashion, is exactly what we are aiming for with our modular approach
to the design of compute plane HA.  If the various monitoring
components adopt a driver-based approach to alerting and/or the
ability to alert via a lowest common denominator format such as simple
HTTP POST of JSON blobs, then it should be possible for each cloud
deployer to integrate the monitors with whichever reporting dashboards
/ recovery workflow controllers best satisfy their requirements.


Second, are you open to configurably having fault events reported to
Vitrage ?


Again I can't speak on behalf of the Masakari project, but this sounds
like a great idea to me :)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-16 Thread Adam Spiers

Waines, Greg  wrote:

thanks for the pointers Sam.

I took a quick look.
I agree that the VM Heartbeat / Health-check looks like a good fit into 
Masakari.

Currently your instance monitoring looks like it is strictly black-box type 
monitoring thru libvirt events.
Is that correct ?
i.e. you do not do any intrusive type monitoring of the instance thru the QUEMU 
Guest Agent facility
  correct ?


That is correct:

https://github.com/openstack/masakari-monitors/blob/master/masakarimonitors/instancemonitor/instance.py


I think this is what VM Heartbeat / Health-check would add to Masaraki.
Let me know if you agree.


OK, so you are looking for something slightly different I guess, based
on this QEMU guest agent?

   https://wiki.libvirt.org/page/Qemu_guest_agent

That would require the agent to be installed in the images, which is
extra work but I imagine quite easily justifiable in some scenarios.
What failure modes do you have in mind for covering with this
approach - things like the guest kernel freezing, for instance?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [vitrage] [nova] [HA] VM Heartbeat / Healthcheck Monitoring

2017-05-16 Thread Adam Spiers

Afek, Ifat (Nokia - IL/Kfar Sava)  wrote:

On 16/05/2017, 4:36, "Sam P"  wrote:

   Hi Greg,

In Masakari [0] for VMHA, we have already implemented some what
   similar function in masakri-monitors.
Masakari-monitors runs on nova-compute node, and monitors the host,
   process or instance failures.
Masakari instance monitor has similar functionality with what you
   have described.
Please see [1] for more details on instance monitoring.
[0] https://wiki.openstack.org/wiki/Masakari
[1] 
https://github.com/openstack/masakari-monitors/tree/master/masakarimonitors/instancemonitor

Once masakari-monitors detect failures, it will send notifications to
   masakari-api to take appropriate recovery actions to recover that VM
   from failures.


You can also find out more about our architectural plans by watching
this talk which Sampath and I gave in Boston:

  
https://www.openstack.org/videos/boston-2017/high-availability-for-instances-moving-to-a-converged-upstream-solution

The slides are here:

  https://aspiers.github.io/openstack-summit-2017-boston-compute-ha/

We didn't go into much depth on monitoring and recovery of individual
VMs, but as Sampath explained, Masakari already handles both of these.


Hi Greg, Sam,

As Vitrage is about correlating alarms that come from different
sources, and is not a monitor by itself – I think that it can benefit
from information retrieved by both Masakari and Zabbix monitors.

Zabbix is already integrated into Vitrage. I don’t know if there are
specific tests for VM heartbeat, but I think it is very likely that
there are.  Regarding Masakari – looking at your documents, I believe
that integrating your monitoring information into Vitrage could be
quite straight forward.


Yes, this makes sense.  Masakari already cleanly decouples
monitoring/alerting from automated recovery, so it could support this
quite nicely.  And the modular converged architecture we explained in
the presentation will maintain that clean separation of
responsibilities whilst integrating Masakari together with other
components such as Pacemaker, Mistral, and maybe Vitrage too.

For example whilst so far this thread has been about VM instance
monitoring, another area where Vitrage could integrate with Masakari
is compute host monitoring.

If you watch this part of our presentation where we explained the next
generation architecture, you'll see that we propose a new
"nova-host-alerter" component which has a driver-based mechanism for
alerting different services when a compute host experiences a failure:

   https://youtu.be/YPKE1guti8E?t=32m43s

So one obvious possibility would be to add a driver for Vitrage, so
that Vitrage can be alerted when Pacemaker spots a host failure.

Similarly, we could extend Pacemaker configurations to alert Vitrage
when individual processes such as nova-compute or libvirtd fail.

If you would like to discuss any of this further or have any more
questions, in addition to this mailing list we are also available to
talk on the #openstack-ha IRC channel!

Cheers,
Adam

P.S. I've added the [HA] badge to this thread since this discussion is
definitely related to high availability.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [HA] follow-up from HA discussion at Boston Forum

2017-05-15 Thread Adam Spiers

Hi all,

Sam P  wrote:

This is a quick reminder for HA Forum session at Boston Summit.
Thank you all for your comments and effort to make this happen in Boston Summit.

Time: Thu 11 , 11:00am-11:40am
Location: Hynes Convention Center - Level One - MR 103
Etherpad: https://etherpad.openstack.org/p/BOS-forum-HA-in-openstack

Please join and let's discuss the HA issues in OpenStack...

--- Regards,
Sampath


Thanks to everyone who came to the High Availability Forum session in
Boston last week!  To me, the great turn-out proved that there is
enough general interest in HA within OpenStack to justify allocating
space for dicussion on those topics not only at each summit, but in
between the summits too.

To that end, I'd like to a) remind everyone of the weekly HA IRC
meetings:

   https://wiki.openstack.org/wiki/Meetings/HATeamMeeting

and also b) highlight an issue that we most likely need to solve:
currently these weekly IRC meetings are held at 0900 UTC on Wednesday:

   http://eavesdrop.openstack.org/#High_Availability_Meeting

which is pretty much useless for anyone in the Americas.  This time
was previously chosen because the most regular attendees were based in
Europe or Asia, but I'm now looking for suggestions on how to make
this fairer for all continents.  Some options:

- Split the 60 minutes in half, and hold two 30 minute meetings
 each week at different times, so that every timezone has convenient
 access to at least one of them.

- Alternate the timezone every other week.  This might make it hard to
 build any kind of momentum.

- Hold two meetings each week.  I'm not sure we'd have enough traffic
 to justify this, but we could try.

Any opinions, or better ideas?  Thanks!

Adam

P.S. Big thanks to Sampath for organising the Boston Forum session
and managing to attract such a healthy audience :-)

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack] OpenStack-Ansible HA solution

2017-04-27 Thread Adam Spiers

Wei Hui  wrote:

Liyuenan (Maxwell Li)  wrote:

Hi, all

I have some questions about OSA project.


[snipped]


2.   Could OSA support compute node high available? If my compute node
down, could the instance on this node move to other nodes?


2. As far as I know, OSA don't support compute node HA.
nova has a feature called evacuate,  but some machenism needed to
detect whether nova-compute was down and trigger evacuate.


If you want to find out more about compute node HA you might be
interested in our upcoming presentation in Boston:

https://www.openstack.org/summit/boston-2017/summit-schedule/events/17971/high-availability-for-instances-moving-to-a-converged-upstream-solution

Cheers,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] Alternative approaches for L3 HA

2017-02-23 Thread Adam Spiers
Anil Venkata <anilvenk...@redhat.com> wrote:
> On Thu, Feb 23, 2017 at 12:10 AM, Miguel Angel Ajo Pelayo 
> <majop...@redhat.com> wrote:
> > On Wed, Feb 22, 2017 at 11:28 AM, Adam Spiers <aspi...@suse.com> wrote:
> >> With help from others, I have started an analysis of some of the
> >> different approaches to L3 HA:
> >>
> >> https://ethercalc.openstack.org/Pike-Neutron-L3-HA
> >>
> >> (although I take responsibility for all mistakes ;-)
> 
> Did you test with this patch https://review.openstack.org/#/c/255237/  ? It
> was merged in newton cycle.
> With this patch, HA+L2pop doesn't depend on control plane during fail over,
> hence failover should be faster(same as without l2pop).

Thanks Anil!  I've updated the spreadsheet to take this into account.

> >> It would be great if someone from RH or RDO could provide information
> >> on how this RDO (and/or RH OSP?) solution based on Pacemaker +
> >> keepalived works - if so, I volunteer to:
> >>
> >>   - help populate column E of the above sheet so that we can
> >> understand if there are still remaining gaps in the solution, and
> >>
> >>   - document it (e.g. in the HA guide).  Even if this only ended up
> >> being considered as a shorter-term solution, I think it's still
> >> worth documenting so that it's another option available to
> >> everyone.
> >>
> >> Thanks!
> 
> > I have updated the spreadsheet.

Thanks a lot Miguel and everyone else who contributed to the
spreadsheet so far!

After a very productive meeting this morning at the PTG, I think it is
quite close to completion now, and I am already working with the docs
team on moving it into official documentation, either in the HA Guide
(which I am trying to help maintain) or the Networking Guide.  I don't
have strong opinions on where it should live - if anyone does then
please let us know now.

I also attempted to write up a mini-report summarising this morning's
meeting for future reference; it's (currently) at line 279 onwards of:

https://etherpad.openstack.org/p/neutron-ptg-pike-final

but I'll reproduce it here for convenience.

The conclusion, at least as I understand it, is as follows:

- The l3_ha solution is already working pretty well in many
  deployments, especially when coupled with a few extra benefits from
  Pacemaker (although
  https://bugs.launchpad.net/neutron/+bugs?field.tag=l3-ha might
  suggest otherwise ...)

- Some more refinements to this solution could be made to reduce the
  remaining corner cases where failures are not handled well.

- I (and hopefully others) will work towards documenting this solution
  in more detail.

- In the mean time, Ann Taraday and anyone else interested may
  continue out-of-tree experiments with different architectures such
  as tooz/etcd.  It is expected that these would be invasive changes,
  possibly taking at least 1-2 release cycles to stabilise, but they
  might still be worth it.

- If a PoC is submitted for review and looks promising, we can decide
  whether it makes sense to aim to replace the existing keepalived
  solution, or instead offer it as an alternative by introducing
  pluggable L3 drivers.  However, adding a driver abstraction layer
  would also be costly and expand the test matrix, at a time where
  developer resources are scarce. So there would need to be a
  compelling reason to do this.

I hope that's a reasonably accurate representation of the outcome from
this morning - obviously feel free to submit comments if I missed or
mistook anything.  Thanks for a great meeting!

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Neutron] Alternative approaches for L3 HA

2017-02-22 Thread Adam Spiers
Kosnik, Lubosz  wrote:
> About success of RDO we need to remember that this deployment utilizes 
> Peacemaker and when I was working on this feature and even I spoke with Assaf 
> this external application was doing everything to make this solution working. 
> Peacemaker was responsible for checking external and internal connectivity. 
> To detect split brain. Elect master, even keepalived was running but 
> Peacemaker was automatically killing all services and moving FIP. 
> Assaf - is there any change in this implementation in RDO? Or you’re still 
> doing everything outside of Neutron? 
> 
> Because if RDO success is build on Peacemaker it means that yes, Neutron 
> needs some solution which will be available for more than RH deployments. 

Agreed.

With help from others, I have started an analysis of some of the 
different approaches to L3 HA: 

https://ethercalc.openstack.org/Pike-Neutron-L3-HA 

(although I take responsibility for all mistakes ;-) 

It would be great if someone from RH or RDO could provide information 
on how this RDO (and/or RH OSP?) solution based on Pacemaker + 
keepalived works - if so, I volunteer to: 

  - help populate column E of the above sheet so that we can
understand if there are still remaining gaps in the solution, and

  - document it (e.g. in the HA guide).  Even if this only ended up
being considered as a shorter-term solution, I think it's still
worth documenting so that it's another option available to
everyone.

Thanks!

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack-operators] Destructive / HA / fail-over scenarios

2016-11-28 Thread Adam Spiers
Timur Nurlygayanov  wrote:
> Hi OpenStack developers and operators,
> 
> we are going to create the test suite for destructive testing of
> OpenStack clouds. We want to hear your feedback and ideas
> about possible destructive and failover scenarios which we need
> to check.
> 
> Which scenarios we need to check if we want to make sure that
> some OpenStack cluster is configured in High Availability mode
> and can be published as a "production/enterprise" cluster.
> 
> Your ideas are welcome, let's discuss the ideas of test scenarios in
> this email thread.

I applaud the effort to boost automated testing of failure scenarios!
And thanks a lot for polling the list before starting any work on
this.

Regarding the implementation, did you consider reusing Cloud 99, and
if not, please could you? :-) Obviously it would be good to avoid
reinventing wheels where possible.


https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/high-availability-and-resiliency-testing-strategies-for-openstack-clouds

https://github.com/cisco-oss-eng/Cloud99

If there are some gaps between Cloud99 and what is needed then it
would be worth evaluating them in order to determine whether it makes
sense to start from scratch versus simply develop Cloud99 further.

Also it would be great if you could join the #openstack-ha IRC channel
where you will find friendly folks from the broader OpenStack HA
sub-community who I'm sure will be happy to discuss this further.

You are also very welcome to join our weekly IRC meetings:

https://wiki.openstack.org/wiki/Meetings/HATeamMeeting

Thanks!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Openstack] The possible ways of high availability for non-cloud-ready apps running on openstack

2016-08-08 Thread Adam Spiers
Hi Hossein,

hossein zabolzadeh  wrote:
> Hi there.
> I am dealing with large amount of legacy application(MediaWiki, Joomla,
> ...) running on openstack. I am looking for the best way to improve high
> availability of my instances. All applications are not designed for
> fail(Non-Cloud-Ready Apps). So, what is the best way of improving HA on my
> non-clustered instances(Stateful Instances)?
> Thanks in advance.

Sorry for the slow reply - I only just noticed this.  Please see this
talk I gave in Austin with Dawid Deja:

  
https://www.openstack.org/videos/video/high-availability-for-pets-and-hypervisors-state-of-the-nation

I believe it should answer your question in a lot of detail, but
please let me know if you have follow-up questions.

Regards,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [HA] RFC: High Availability track at future Design Summits

2016-08-02 Thread Adam Spiers
Hi Thierry,

Thierry Carrez <thie...@openstack.org> wrote:
> Adam Spiers wrote:
> > I doubt anyone would dispute that High Availability is a really
> > important topic within OpenStack, yet none of the OpenStack
> > conferences or Design Summits so far have provided an "official" track
> > or similar dedicated space for discussion on HA topics.
> > [...]
> 
> We do not provide a specific track at the "Design Summit" for HA (or for
> hot upgrades for the matter) but we have space for "cross-project
> workshops" in which HA topics would be discussed. I suspect what you
> mean here is that the one of two sessions that the current setup allows
> are far from enough to tackle that topic efficiently ?

Yes, I think that's probably true.  I get the impression cross-project
workshops are intended more for coordination of common topics between
many official big tent projects, whereas our topics typically involve
a small handful of projects, some of which are currently unofficial.

> IMHO there is dedicated space -- just not enough of it. It's one of the
> issues with the current Design Summit setup -- just not enough time and
> space to tackle everything in one week. With the new event format I
> expect we'll be able to free up more time to discuss such horizontal
> issues

Right.  I'm looking forward to the new format :-)

> but as far as Barcelona goes (where we have less space and less
> time than in Austin), I'd encourage you to still propose cross-project
> workshops (and engage on the Ops side of the Design Summit to get
> feedback from there as well).

OK thanks - I'll try to figure out the best way of following up on
these two points.  I see that

  https://wiki.openstack.org/wiki/Design_Summit/Ocata/Etherpads

is still empty, so I guess we're still on the early side of planning
for design summit tracks, which hopefully means there's still time
to propose a fishbowl session for Ops feedback on HA.

Thanks a lot for the advice!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [HA] RFC: High Availability track at future Design Summits

2016-08-01 Thread Adam Spiers
Hi all,

I doubt anyone would dispute that High Availability is a really
important topic within OpenStack, yet none of the OpenStack
conferences or Design Summits so far have provided an "official" track
or similar dedicated space for discussion on HA topics.

This is becoming increasingly problematic as the number of HA topics
increase.  For example, in Austin a group of us spent something like
15 hours together over 3-4 days for design sessions around the future
of HA for the compute plane.

This is not by any means the only HA topic which needs discussing.
Other possible topics:

  - Input from operators on their experiences of deployment,
maintenance, and effectiveness of highly available OpenStack
infrastructure

  - Adding or improving HA support in existing projects, e.g.

  - cinder-volume active/active work is currently ongoing

  - neutron always has ongoing HA topics - the hot one in
Austin seemed to be HA+DVR+SNAT.

  - We had some great discussions with the Congress team in
Austin, which may need follow-up.

  - mistral is involved in ongoing HA work.

  - The various projects playing on the HA scene (Senlin is
another example) need the opportunity to sync up with each
other to become aware of any opportunities for integration or
potential overlap.

  - Documentation (the official HA guide)

  - Different / new approaches to HA of the control plane
(e.g. Pacemaker vs. systemd vs. other clustering technologies)

  - Testing and hardening of existing HA architectures (e.g. via
projects such as cloud99)

Whilst we do have the #openstack-ha IRC channel, weekly IRC meetings,
and of course the mailing lists, I think it would be helpful to have
an official space in the design summits for continuation of those
technical discussions face-to-face.

Granted, some of the above topics could be discussed in the related
project track (cinder, neutron, congress, documentation etc.).  But
this does not provide a forum for detailed technical discussion on
cross-project initiatives such as compute HA, or architectural debates
which don't relate to a single project, or work on HA projects which
don't have their own dedicated track in the Design Summit.

Therefore I would like to propose that future Design Summits adopt an
official HA "mini-track" (I guess one day might be sufficient), and
I'd really appreciate hearing opinions on this proposal.

Also the idea meets enough favour, it would be useful to find it
whether it's already too late to arrange this for Barcelona :-)

Thanks a lot!
Adam

P.S. Maybe a similar proposal on a smaller scale would be valid for
some of the operator and regional meetups too?

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [HA] weekly High Availability meetings on IRC: change of time

2016-08-01 Thread Adam Spiers
Hi everyone,

I have proposed moving the weekly High Availability IRC meetings one
hour later, back to the original time of 0900 UTC every Monday.

  https://review.openstack.org/#/c/349601/

Everyone is welcome to attend these meetings, so if you think you are
likely to regularly attend, feel free to vote on that review.

Thanks!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [HA] RFC: user story including hypervisor reservation / host maintenance / storage AZs / event history

2016-06-07 Thread Adam Spiers
[Cc'ing product-wg@ - when replying, first please consider whether
cross-posting is appropriate]

Hi all,

Currently the OpenStack HA community is putting a lot of effort into
converging on a single upstream solution for high availability of VMs
and hypervisors[0], and we had a lot of very productive discussions in
Austin on this topic[1].

One of the first areas of focus is the high level user story:

   
http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html

In particular, there is an open review on which we could use some
advice from the wider community.  The review proposes adding four
extra usage scenarios to the existing user story.  All of these
scenarios are to some degree related to HA of VMs and hypervisors,
however none of them exclusively - they all have scope extending to
other areas beyond HA.  Here's a very brief summary of all four, as
they relate to HA:

1. "Sticky" shared storage zones

   Scenario: all compute hosts have access to exactly one shared
   storage "availability zone" (potentially independent of the normal
   availability zones).  For example, there could be multiple NFS
   servers, and every compute host has /var/lib/nova/instances mounted
   to one of them.  On first boot, each VM is *implicitly* assigned to
   a zone, depending on which compute host nova-scheduler picks for it
   (so this could be more or less random).  Subsequent operations such
   as "nova evacuate" would need to ensure the VM only ever moves to
   other hosts in the same zone.

2. Hypervisor reservation

   The operator wants a mechanism for reserving some compute hosts
   exclusively for use as failover hosts on which to automatically
   resurrect VMs from other failed compute nodes.

3. Host maintenance

   The operator wants a mechanism for flagging hosts as undergoing
   maintenance, so that the HA mechanisms for automatic recovery are
   temporarily disabled during the maintenance window.

4. Event history

   The operator wants a way to retrieve the history of what, when,
   where and how the HA automatic recovery mechanism is performed.

And here's the review in question:

   https://review.openstack.org/#/c/318431/

My first instinct was that all of these scenarios are sufficiently
independent, complex, and extend far enough outside HA scope, that
they deserve to live in four separate user stories, rather than adding
them to our existing "HA for VMs" user story.  This could also
maximise the chances of converging on a single upstream solution for
each which works both inside and outside HA contexts.  (Please read
the review's comments for much more detail on these arguments.)

However, others made the very valid point that since there are
elements of all these stories which are indisputably related to HA for
VMs, we still need the existing user story for HA VMs to cover them,
so that it can provide "the big picture" which will tie together all
the different strands of work it requires.

So we are currently proposing to take the following steps:

 - Propose four new user stories for each of the above scenarios.

 - Link to the new stories from the "Related User Stories" section of
   the existing HA VMs story.

 - Extend the existing story so that it covers the HA-specific aspects of
   the four cases, leaving any non-HA aspects to be covered by the newly
   linked stories.

Then each story would go through the standard workflow defined by the PWG:

   https://wiki.openstack.org/wiki/ProductTeam/User_Stories

Does this sound reasonable, or is there a better way?

BTW, whilst this email is primarily asking for advice on the process,
feedback on each story is also welcome, whether it's "good idea", "you
can already do that", or "terrible idea!" ;-)  However please first
read the comments on the above review, as the obvious points have
probably already been covered :-)

Thanks a lot!

Adam

[0] A complete description of the problem area and existing solutions
was given in this talk:

  
https://www.openstack.org/videos/video/high-availability-for-pets-and-hypervisors-state-of-the-nation

[1] https://etherpad.openstack.org/p/newton-instance-ha

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [HA] About next HA Team meeting (VoIP)

2016-05-09 Thread Adam Spiers
Sam P  wrote:
> Hi All,
> 
>  In today's ( 9th May 2016) meeting we agree to skip the next IRC
> meeting (which is 16th May 2016)  in favour of a gotomeeting VoIP on
> 18th May 2016 (Wednesday).
>  Today's meeting logs and summary can be found here.
>  http://eavesdrop.openstack.org/meetings/ha/2016/ha.2016-05-09-08.04.html
> 
>  About the meeting Time:
>  Every one was convenient with 8:00am UTC.
>  However due to some resource allocation issues, I would like to shift
> this VoIP meeting to
>  9am UTC 18th May 2016
> 
>  Please let me know if you are convenient or not with this time slot.

That later time is fine for me :)  Thanks!

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [HA] weekly High Availability meetings on IRC start next Monday

2015-11-12 Thread Adam Spiers
Sergii Golovatiuk  wrote:
> > > [2] declares meetings at 9am UTC which might be tough for US based
> > folks. I
> > > might be wrong here as I don't know the location of HA experts.
> > >
> > >  [2] http://eavesdrop.openstack.org/#High_Availability_Meeting
> >
> > Yes, I was aware of this :-/  The problem is that the agenda for the
> > first meeting will focus on hypervisor HA, and the interested parties
> > who met in Tokyo are all based in either Europe or Asia (Japan and
> > Australia).  It's hard but possible to find a time which accommodates
> > two continents, but almost impossible to find a time which
> > accommodates three :-/
> >
> >
> I ran into issues setting event in UTC. Use Ghana/Accra in google calendar
> as it doesn't have UTC time zone

Speaking on behalf of my home town, United Kingdom/London would also
work ;-)

But even easier, just add this URL to your Google Calendar:

  http://eavesdrop.openstack.org/calendars/high-availability-meeting.ics

or if you want to really spam your calendar, you can add all OpenStack
meetings in one go :-)

  http://eavesdrop.openstack.org/irc-meetings.ical

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [HA] weekly High Availability meetings on IRC start next Monday

2015-11-12 Thread Adam Spiers
Hi Sergii,

Thanks a lot for the feedback!

Sergii Golovatiuk  wrote:
> Hi Adam,
> 
> It's great we are moving forward with HA community. Thank you so much for
> brining HA to next level. However, I have couple of comments
> 
> [1] contains agenda. I guess we should move it to
> https://etherpad.openstack.org. That will allow people to add own topics to
> discuss. Action items can be put there as well.
>
>  [1] https://wiki.openstack.org/wiki/Meetings/HATeamMeeting

It's a wiki, so anyone can already add their own topics, and in fact
the page already encourages people to do that :-)

I'd prefer to keep it as a wiki because that is consistent with the
approach of all the other OpenStack meetings, as recommended by

  https://wiki.openstack.org/wiki/Meetings/CreateaMeeting#Add_a_Meeting

It also results in a better audit trail than etherpad (where changes
can be made anonymously).

Action items will be captured by the MeetBot:

  https://git.openstack.org/cgit/openstack-infra/meetbot/tree/doc/Manual.txt

> [2] declares meetings at 9am UTC which might be tough for US based folks. I
> might be wrong here as I don't know the location of HA experts.
> 
>  [2] http://eavesdrop.openstack.org/#High_Availability_Meeting

Yes, I was aware of this :-/  The problem is that the agenda for the
first meeting will focus on hypervisor HA, and the interested parties
who met in Tokyo are all based in either Europe or Asia (Japan and
Australia).  It's hard but possible to find a time which accommodates
two continents, but almost impossible to find a time which
accommodates three :-/

If it's any consolation, the meeting logs will be available
afterwards, and also the meeting is expected to be short (around 30
minutes) since the majority of work will continue via email and IRC
outside the meeting.  This first meeting is mainly to set a direction
for future collaboration.

However suggestions for how to handle this better are always welcome,
and if the geographical distribution of attendees of future meetings
changes then of course we can consider changing the time in order to
accommodate.  I want it to be an inclusive sub-community.

Cheers,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [HA] ANNOUNCE: new "[HA]" topic category in mailman configuration

2015-11-12 Thread Adam Spiers
Hi all,

As you may know, Mailman allows server-side filtering of mailing list
traffic by topic categories:

  http://lists.openstack.org/cgi-bin/mailman/options/openstack-dev

(N.B. needs authentication)

Thierry has kindly added "[HA]" as a new topic category in the mailman
configuration for this list, so please tag all mails related to High
Availability with this prefix so that it can be detected by both
server-side and client-side mail filters belonging to people
interested in HA discussions.

Thanks!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [HA] weekly High Availability meetings on IRC start next Monday

2015-11-10 Thread Adam Spiers
Hi all,

After some discussion in Tokyo by stakeholders in OpenStack High
Availability, I'm pleased to announce that from next Monday we're
starting a series of weekly meetings on IRC.  Details are here:

  https://wiki.openstack.org/wiki/Meetings/HATeamMeeting
  http://eavesdrop.openstack.org/#High_Availability_Meeting

The agenda for the first meeting is set and will focus on

  1. the pros and cons of the existing approaches to hypervisor HA
 which rely on automatic resurrection[0] of VMs, and

  2. how we might be able to converge on a best-of-breed solution.

All are welcome to join!

On a related note, even if you can't attend the meeting, you can still
use the new FreeNode IRC channel #openstack-ha for all HA-related
discussion.

Cheers,
Adam

[0] In the OpenStack community resurrection is commonly referred to
as "evacuation", which is a slightly unfortunate misnomer.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ANNOUNCE] [HA] new #openstack-ha IRC channel on FreeNode

2015-10-22 Thread Adam Spiers
Sorry!  It would have helped if I'd used the right address for the
openstack list in the To: and Reply-To: headers :-/

Hopefully second time lucky ...

Adam Spiers <aspi...@suse.com> wrote:
> [cross-posting to several lists; please trim the recipients list
> before replying!]
> 
> Hi all,
> 
> After discussion with members of the openstack-infra team, I
> registered new FreeNode IRC channel #openstack-ha.  Discussion on all
> aspects of OpenStack High Availability is welcome in this channel.
> Hopefully it will help promote cross-pollination of ideas and maybe
> even more convergence on upstream solutions.
> 
> I added it to https://wiki.openstack.org/wiki/IRC and also set up the
> gerritbot to auto-announce activity for the "new"
> openstack-resource-agents repository which I announced separately
> yesterday:
> 
>   http://lists.openstack.org/pipermail/openstack-dev/2015-October/077601.html
> 
> Still TODO: set up eavesdrop to record channel logs:
> 
>   https://review.openstack.org/#/c/237341/
> 
> Cheers,
> Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [ANNOUNCE] [HA] new #openstack-ha IRC channel on FreeNode

2015-10-22 Thread Adam Spiers
[cross-posting to several lists; please trim the recipients list
before replying!]

Hi all,

After discussion with members of the openstack-infra team, I
registered new FreeNode IRC channel #openstack-ha.  Discussion on all
aspects of OpenStack High Availability is welcome in this channel.
Hopefully it will help promote cross-pollination of ideas and maybe
even more convergence on upstream solutions.

I added it to https://wiki.openstack.org/wiki/IRC and also set up the
gerritbot to auto-announce activity for the "new"
openstack-resource-agents repository which I announced separately
yesterday:

  http://lists.openstack.org/pipermail/openstack-dev/2015-October/077601.html

Still TODO: set up eavesdrop to record channel logs:

  https://review.openstack.org/#/c/237341/

Cheers,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [ANNOUNCE] [HA] [Pacemaker] new, maintained openstack-resource-agents repository

2015-10-21 Thread Adam Spiers
[cross-posting to openstack-dev and pacemaker user lists; please
consider trimming the recipients list if your reply is not relevant to
both communities]

Hi all,

Back in June I proposed moving the well-used but no longer maintained
https://github.com/madkiss/openstack-resource-agents/ repository to
Stackforge:

  http://lists.openstack.org/pipermail/openstack-dev/2015-June/067763.html
  https://github.com/madkiss/openstack-resource-agents/issues/22

The responses I got were more or less unanimously in favour, so I'm
simultaneously pleased and slightly embarrassed to announce that 4
months later, I've finally followed up on my proposal:

  https://launchpad.net/openstack-resource-agents
  https://git.openstack.org/cgit/openstack/openstack-resource-agents/
  
https://review.openstack.org/#/admin/projects/openstack/openstack-resource-agents
  
https://review.openstack.org/#/q/status:open+project:openstack/openstack-resource-agents,n,z

Since June, Stackforge has been retired, so as you can see above, this
repository lives under the 'openstack' namespace.

I volunteered to be a maintainer and there were no objections.  I sent
out an initial call for co-maintainers but noone expressed an interest
which is probably fine because the workload is likely to be quite
light.  However if you'd like to be involved please drop me a line.

I've also taken care of outstanding pull requests and bug reports
against the old repository, and providing a redirect from the old
repository's README to the new one.

Still TODO: adding this repository to the Big Tent.  I've had some
discussions with the openstack-infra team about that, since there is
not currently a suitable project team to create it under.  We might
create a new project team called "OpenStack Pacemaker" or similar, and
place it under that.  ("OpenStack HA" would be far too broad to be
able to find a single PTL.)  However there is no rush for this, and it
has been suggested that it would not be a bad thing to wait for the
"new" project to stabilise and prove its longevity before making it
official.

Cheers,
Adam

P.S. I'll be in Tokyo if anyone wants to meet there and discuss
further.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [HA] RFC: moving Pacemaker openstack-resource-agents to stackforge

2015-06-23 Thread Adam Spiers
[cross-posting to openstack-dev and pacemaker lists; please consider
trimming the recipients list if your reply is not relevant to both
communities]

Hi all,

https://github.com/madkiss/openstack-resource-agents/ is a nice
repository of Pacemaker High Availability resource agents (RAs) for
OpenStack, usage of which has been officially recommended in the
OpenStack High Availability guide.  Here is one of several examples:


http://docs.openstack.org/high-availability-guide/content/_add_openstack_identity_resource_to_pacemaker.html

Martin Loschwitz, who owns this repository, has since moved away from
OpenStack, and no longer maintains it.  I recently proposed moving the
repository to StackForge, and he gave his consent and in fact said
that he had the same intention but hadn't got round to it:


https://github.com/madkiss/openstack-resource-agents/issues/22#issuecomment-113386505

You can see from that same github issue that several key members of
the OpenStack Pacemaker sub-community are all in favour.  Therefore
I am volunteering to do the move to StackForge.

Another possibility would be to move each RA to its corresponding
OpenStack project, although this makes a lot less sense to me, since
it would require the core members of every OpenStack project to care
enough about Pacemaker to agree to maintain an RA for it.

This raises the question of maintainership.  SUSE has a vested
interest in these resource agents, so we would be happy to help
maintain them.  I believe Red Hat is also using these, so any
volunteers from there or indeed anywhere else to co-maintain would be
welcome.  They are already fairly complete, and I don't expect there
will be a huge amount of change.

I'm probably getting ahead of myself, but the other big question is
regarding CI.  Currently there are no tests at all.  Of course we
could add bashate, and maybe even some functional tests, but
ultimately some integration tests would be really nice.  However for
now I propose we focus on the move and defer CI work till later.

Thoughts?

Thanks!
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ClusterLabs] [HA] RFC: moving Pacemaker openstack-resource-agents to stackforge

2015-06-23 Thread Adam Spiers
Digimer li...@alteeve.ca wrote:
 Resending to the Cluster Labs mailing list, this list is deprecated

Thanks, I only realised that after getting a deprecation warning :-(

 On 23/06/15 06:27 AM, Adam Spiers wrote:
  [cross-posting to openstack-dev and pacemaker lists; please consider
  trimming the recipients list if your reply is not relevant to both
  communities]
  
  Hi all,
  
  https://github.com/madkiss/openstack-resource-agents/ is a nice
  repository of Pacemaker High Availability resource agents (RAs) for
  OpenStack, usage of which has been officially recommended in the
  OpenStack High Availability guide.  Here is one of several examples:
  
  
  http://docs.openstack.org/high-availability-guide/content/_add_openstack_identity_resource_to_pacemaker.html
  
  Martin Loschwitz, who owns this repository, has since moved away from
  OpenStack, and no longer maintains it.  I recently proposed moving the
  repository to StackForge, and he gave his consent and in fact said
  that he had the same intention but hadn't got round to it:
  
  
  https://github.com/madkiss/openstack-resource-agents/issues/22#issuecomment-113386505
  
  You can see from that same github issue that several key members of
  the OpenStack Pacemaker sub-community are all in favour.  Therefore
  I am volunteering to do the move to StackForge.
 
 There is a CusterLabs group on github that most of the HA cluster
 projects have or are moving under. Why not use that?

This question was asked and answered in the github issue:

https://github.com/madkiss/openstack-resource-agents/issues/22#issuecomment-114147300

Cheers,
Adam

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] tools for making upstreaming / backporting easier in git

2013-09-24 Thread Adam Spiers
Hi all,

Back in April, I created some wrapper scripts around git-cherry(1) and
git-notes(1), which can help when you have more than a trivial number
of commits to upstream or backport from one branch to another.  Since
then I've improved these tools, and also written a higher-level CLI
which should make the whole process pretty easy.

Last week I finally finished a blog post with all the details:

  http://blog.adamspiers.org/2013/09/19/easier-upstreaming-with-git/

in which I demonstrate how to use the tools via an artificial example
involving backporting some commits from Nova's master branch to its
stable/grizzly release branch.

These tools worked pretty well for me and my team on code outside
OpenStack, but no doubt some people will have ideas how to improve
them, or have different techniques for tackling the problem.  Either
way, I hope this is of interest, and I'd love to hear what people
think!

Cheers,
Adam

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev