Re: [openstack-dev] [oslo][db] Mysql traditional session mode

2014-01-23 Thread Florian Haas
Ben,

thanks for taking this to the list. Apologies for my brevity and for HTML,
I'm on a moving train and Android Gmail is kinda stupid. :)

On Jan 23, 2014 6:46 PM, Ben Nemec openst...@nemebean.com wrote:

 A while back a change (https://review.openstack.org/#/c/47820/) was made
to allow enabling mysql traditional mode, which tightens up mysql's input
checking to disallow things like silent truncation of strings that exceed
the column's allowed length and invalid dates (as I understand it).

 IMHO, some compelling arguments were made that we should always be using
traditional mode and as such we started logging a warning if it was not
enabled.  It has recently come to my attention (
https://review.openstack.org/#/c/68474/) that not everyone agrees, so I
wanted to bring it to the list to get as wide an audience for the
discussion as possible and hopefully come to a consensus so we don't end up
having this discussion every few months.

For the record, I obviously am all in favor of avoiding data corruption,
although it seems not everyone agrees that TRADITIONAL is necessarily the
preferable mode. But that aside, if Oslo decides that any particular mode
is required, it should just go ahead and set it, rather than log a warning
that the user can't possibly fix.

 I remain of the opinion that traditional mode is a good thing and we
_should_ be enabling it.  I would call silent truncation and bogus date
values bugs that should be fixed, but maybe there are other implications of
this mode that I'm not aware of.

 It was also pointed out that the warning is logged even if the user
forces traditional mode through my.cnf.  While this certainly solves the
underlying problem, it doesn't change the fact that the application was
trying to do something bad.  We tried to make it clear in the log message
that this is a developer problem and the user needs to pester the developer
to enable the mode, but maybe there's more discussion that needs to go on
there as well.

Hence my proposal to make this a config option and actually set the mode on
connect. To make the patch as un-invasive as possible, the default for that
option is currently empty, but if it seems prudent to set TRADITIONAL or
STRICT_ALL_TABLES instead, I'll be happy to fix the patch up accordingly.

Cheers,
Florian
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][db] Mysql traditional session mode

2014-01-24 Thread Florian Haas
On Thu, Jan 23, 2014 at 7:22 PM, Ben Nemec openst...@nemebean.com wrote:
 On 2014-01-23 12:03, Florian Haas wrote:

 Ben,

 thanks for taking this to the list. Apologies for my brevity and for HTML,
 I'm on a moving train and Android Gmail is kinda stupid. :)

 I have some experience with the quirks of phone GMail myself. :-)

 On Jan 23, 2014 6:46 PM, Ben Nemec openst...@nemebean.com wrote:

 A while back a change (https://review.openstack.org/#/c/47820/) was made
 to allow enabling mysql traditional mode, which tightens up mysql's input
 checking to disallow things like silent truncation of strings that exceed
 the column's allowed length and invalid dates (as I understand it).

 IMHO, some compelling arguments were made that we should always be using
 traditional mode and as such we started logging a warning if it was not
 enabled.  It has recently come to my attention
 (https://review.openstack.org/#/c/68474/) that not everyone agrees, so I
 wanted to bring it to the list to get as wide an audience for the discussion
 as possible and hopefully come to a consensus so we don't end up having this
 discussion every few months.

 For the record, I obviously am all in favor of avoiding data corruption,
 although it seems not everyone agrees that TRADITIONAL is necessarily the
 preferable mode. But that aside, if Oslo decides that any particular mode is
 required, it should just go ahead and set it, rather than log a warning that
 the user can't possibly fix.


 Honestly, defaulting it to enabled was my preference in the first place.  I
 got significant pushback though because it might break consuming
 applications that do the bad things traditional mode prevents.

Wait. So the reasoning behind the pushback was that an INSERT that
shreds data is better than an INSERT that fails? Really?

 My theory
 was that we could default it to off, log the warning, get all the projects
 to enable it as they can, and then flip the default to enabled.  Obviously
 that hasn't all happened though. :-)

Wouldn't you think it's a much better approach to enable whatever mode
is deemed appropriate, and have malformed INSERTs (rightfully) break?
Isn't that a much stronger incentive to actually fix broken code?

The oslo tests do include a unit test for this, jftr, checking for an
error to be raised when a 512-byte string is inserted into a 255-byte
column.

 Hence my proposal to make this a config option. To make the patch as
 un-invasive as possible, the default for that option is currently empty, but
 if it seems prudent to set TRADITIONAL or STRICT_ALL_TABLES instead, I'll be
 happy to fix the patch up accordingly.

 Also check out Jay's reply.  It sounds like there are some improvements we
 can make as far as not logging the message when the user enables traditional
 mode globally.

And then when INSERTs break, it will be much more difficult for an
application developer to figure out the problem, because the breakage
would happen based on a configuration setting outside the codebase,
and hence beyond the developer's control. I really don't like that
idea. All this leads to is bugs being filed and then closed with a
simple can't reproduce.

 I'm still not clear on whether there is a need for the STRICT_* modes, and
 if there is we should probably also allow STRICT_TRANS_TABLES since that
 appears to be part of strict mode in MySQL.  In fact, if we're going to
 allow arbitrary modes, we may need a more flexible config option - it looks
 like there are a bunch of possible sql_modes available for people who don't
 want the blanket disallow all the things mode.

Fair enough, I can remove the choices arg for the StrOpt, if that's
what you suggest. My concern was about unsanitized user input. Your
inline comment on my patch seems to indicate that we should instead
trust sqla to do input sanitization properly.

I still maintain that leaving $insert_mode_here mode off and logging a
warning is silly. If it's necessary, turn it on and have borked
INSERTs fail. If I understand the situation correctly, they would fail
anyway the moment someone switches to, say, Postgres.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][db] Mysql traditional session mode

2014-01-24 Thread Florian Haas
On Fri, Jan 24, 2014 at 4:30 PM, Doug Hellmann
doug.hellm...@dreamhost.com wrote:



 On Fri, Jan 24, 2014 at 3:29 AM, Florian Haas flor...@hastexo.com wrote:

 On Thu, Jan 23, 2014 at 7:22 PM, Ben Nemec openst...@nemebean.com wrote:
  On 2014-01-23 12:03, Florian Haas wrote:
 
  Ben,
 
  thanks for taking this to the list. Apologies for my brevity and for
  HTML,
  I'm on a moving train and Android Gmail is kinda stupid. :)
 
  I have some experience with the quirks of phone GMail myself. :-)
 
  On Jan 23, 2014 6:46 PM, Ben Nemec openst...@nemebean.com wrote:
 
  A while back a change (https://review.openstack.org/#/c/47820/) was
  made
  to allow enabling mysql traditional mode, which tightens up mysql's
  input
  checking to disallow things like silent truncation of strings that
  exceed
  the column's allowed length and invalid dates (as I understand it).
 
  IMHO, some compelling arguments were made that we should always be
  using
  traditional mode and as such we started logging a warning if it was not
  enabled.  It has recently come to my attention
  (https://review.openstack.org/#/c/68474/) that not everyone agrees, so
  I
  wanted to bring it to the list to get as wide an audience for the
  discussion
  as possible and hopefully come to a consensus so we don't end up having
  this
  discussion every few months.
 
  For the record, I obviously am all in favor of avoiding data corruption,
  although it seems not everyone agrees that TRADITIONAL is necessarily
  the
  preferable mode. But that aside, if Oslo decides that any particular
  mode is
  required, it should just go ahead and set it, rather than log a warning
  that
  the user can't possibly fix.
 
 
  Honestly, defaulting it to enabled was my preference in the first place.
  I
  got significant pushback though because it might break consuming
  applications that do the bad things traditional mode prevents.

 Wait. So the reasoning behind the pushback was that an INSERT that
 shreds data is better than an INSERT that fails? Really?

  My theory
  was that we could default it to off, log the warning, get all the
  projects
  to enable it as they can, and then flip the default to enabled.
  Obviously
  that hasn't all happened though. :-)

 Wouldn't you think it's a much better approach to enable whatever mode
 is deemed appropriate, and have malformed INSERTs (rightfully) break?
 Isn't that a much stronger incentive to actually fix broken code?

 The oslo tests do include a unit test for this, jftr, checking for an
 error to be raised when a 512-byte string is inserted into a 255-byte
 column.

  Hence my proposal to make this a config option. To make the patch as
  un-invasive as possible, the default for that option is currently empty,
  but
  if it seems prudent to set TRADITIONAL or STRICT_ALL_TABLES instead,
  I'll be
  happy to fix the patch up accordingly.
 
  Also check out Jay's reply.  It sounds like there are some improvements
  we
  can make as far as not logging the message when the user enables
  traditional
  mode globally.

 And then when INSERTs break, it will be much more difficult for an
 application developer to figure out the problem, because the breakage
 would happen based on a configuration setting outside the codebase,
 and hence beyond the developer's control. I really don't like that
 idea. All this leads to is bugs being filed and then closed with a
 simple can't reproduce.

  I'm still not clear on whether there is a need for the STRICT_* modes,
  and
  if there is we should probably also allow STRICT_TRANS_TABLES since that
  appears to be part of strict mode in MySQL.  In fact, if we're going
  to
  allow arbitrary modes, we may need a more flexible config option - it
  looks
  like there are a bunch of possible sql_modes available for people who
  don't
  want the blanket disallow all the things mode.

 Fair enough, I can remove the choices arg for the StrOpt, if that's
 what you suggest. My concern was about unsanitized user input. Your
 inline comment on my patch seems to indicate that we should instead
 trust sqla to do input sanitization properly.

 I still maintain that leaving $insert_mode_here mode off and logging a
 warning is silly. If it's necessary, turn it on and have borked
 INSERTs fail. If I understand the situation correctly, they would fail
 anyway the moment someone switches to, say, Postgres.


 +1

 Doug

Updated patch set:
https://review.openstack.org/#/q/status:open+project:openstack/oslo-incubator+branch:master+topic:bug-1271706,n,z

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [ceilometer] Exposing Ceilometer alarms as SNMP traps

2014-04-24 Thread Florian Haas
Hello everyone,

I'd just like to throw something out there for discussion. Please note
that I've CC'd the operators list to reach a wider audience (including
the would-be users of the feature I'm about to discuss), but this is
rather firmly a development issue, so it would be great if we could
keep the responses exclusively on the -dev list.

I've been talking to OpenStack users in the telco and NFV space a lot
as of late, and one of the things that always come up is reporting
abnormal events from OpenStack into an SNMP based network management
system. Ceilometer is the obvious place to plug into, and with the
help of Eoghan and Julien in
https://plus.google.com/u/0/+FlorianHaas/posts/9BC46ozA8T3 I've come
to the realization that writing an alarm notifier that bridges into
pysnmp would be relatively straightforward. Effectively one would
extend ceilometer.alarm.notifier.AlarmNotifier, and use a Notification
Originator as described in
http://pysnmp.sourceforge.net/examples/current/v3arch/oneliner/agent/ntforg/trap-v2c-with-mib-lookup.html
to send the trap.

The tricky part is of course to decide which MIB one should support.
Creating Ceilometer's own MIB sounds like it would be disadvantageous
versus using a publicly known MIB that already exists for the purpose.
Doing some cursory research I've come across RFC 3877
(https://tools.ietf.org/html/rfc3877) which defines a generic Alarm
MIB. My SNMP experience, however, is way too limited to decide
whether or not that is actually widely used and would be suitable.

There are interesting side issues here, by the way, such as the fact
that Ceilometer alarms currently have no concept of severity, which is
somewhat crucial to the RFC 3877 Alarm model (and presumably, also for
other alarm use cases). But that's separate from the SNMP discussion.

If anyone with greater SNMP and NMS experience than my own could share
thoughts here, that would be great. In addition, if someone is
interested in doing a Design Summit session or a BoF on this subject
in Atlanta, or even just meet informally and discuss ideas, please let
me know. Thank you!

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ceilometer] Exposing Ceilometer alarms as SNMP traps

2014-04-24 Thread Florian Haas
[Dropping -operators from CC list]

On Thu, Apr 24, 2014 at 2:55 PM, Julien Danjou jul...@danjou.info wrote:
 On Thu, Apr 24 2014, Florian Haas wrote:

 There are interesting side issues here, by the way, such as the fact
 that Ceilometer alarms currently have no concept of severity, which is
 somewhat crucial to the RFC 3877 Alarm model (and presumably, also for
 other alarm use cases). But that's separate from the SNMP discussion.

 This is actually not a problem. You could specify a severity and various
 parameters as part as the custom SNMP URL you would specify in the alarm
 action field.

And down the rabbit hole we go. :)

Currently, AlarmNotifier says:

def notify(self, action, alarm_id, previous, current, reason, reason_data):
Notify that an alarm has been triggered.

:param action: The action that is being attended, as a parsed URL.

So for any inheriting subclass, the notify method signature is defined
such that action needs to be a URL. That doesn't make a whole lot of
sense for anything other than a ReSTful service. If we want to map
those to SNMP URIs, then there's RFC 4088 that describes that. But
those URIs, to the best of my knowledge, can't be used for traps.

Maybe it's time to define action more broadly at the abstract
superclass level? And let implementing notifiers set their own
specific requirements on its format?

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ceilometer] Exposing Ceilometer alarms as SNMP traps

2014-04-24 Thread Florian Haas
On Thu, Apr 24, 2014 at 4:20 PM, Julien Danjou jul...@danjou.info wrote:
 On Thu, Apr 24 2014, Florian Haas wrote:

 So for any inheriting subclass, the notify method signature is defined
 such that action needs to be a URL. That doesn't make a whole lot of
 sense for anything other than a ReSTful service. If we want to map
 those to SNMP URIs, then there's RFC 4088 that describes that. But
 those URIs, to the best of my knowledge, can't be used for traps.

 Actually you can use anything with URL, we could use something like:

 snmptrap://destination/oid?community=publicurgency=high

 And that would do it.
 (not sure about the parameters and all, I'm no SNMP trap connoiseur, you
 get the idea)

But that would be another case of wheel reinvention. To me the idea to
express an SNMP trap as a URI sounds rather ludicrous to begin with;
it doesn't get any more reasonable by *not* using the scheme that
someone else has already invented, and instead inventing one's own.

What does seem stranger to me in the first place is to require a
generic event action to be a URL.

What do others think?

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ceilometer] Exposing Ceilometer alarms as SNMP traps

2014-04-25 Thread Florian Haas
Hi Eric,

On Thu, Apr 24, 2014 at 7:02 PM, Eric Brown bro...@vmware.com wrote:
 I'm pretty familiar with SNMP as I have worked with it for a number years.
 I know Telcos like it, but I feel its a protocol that is near end of life.
 It hasn't
 kept up on security guidelines.  SNMPv1 and v2c are totally insecure and
 SNMPv3 is barely usable.  But even SNMPv3 still uses MD5 and SHA1.

I agree, but at least with my limited SNMP experience I've seen quite
a few v2c deployments out there, so forgoing that altogether doesn't
seem like a good idea to me.

 That being said, the Alarm MIB would be my choice of MIB.  A custom MIB
 would be a mess and a nightmare to maintain.

Thanks for confirming. :)

 Can pysnmp do v3 notifications?  You might want to also consider informs
 rather than traps since they are acknowledged.

Yes, pysnmp can do INFORMs:
http://pysnmp.sourceforge.net/examples/current/v3arch/oneliner/agent/ntforg/inform-v3.html

However, speaking of acknowledgments, is the concept of an alert being
acknowledged even present in Ceilometer?

I'm afraid I've opened a can of worms here. :)

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-15 Thread Florian Haas
On Wed, Oct 15, 2014 at 7:20 PM, Russell Bryant rbry...@redhat.com wrote:
 On 10/13/2014 05:59 PM, Russell Bryant wrote:
 Nice timing.  I was working on a blog post on this topic.

 which is now here:

 http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/

I am absolutely loving the fact that we are finally having a
discussion in earnest about this. i think this deserves a Design
Summit session.

If I may weigh in here, let me share what I've seen users do and what
can currently be done, and what may be supported in the future.

Problem: automatically ensure that a Nova guest continues to run, even
if its host fails.

(That's the general problem description and I don't need to go into
further details explaining the problem, because Russell has done that
beautifully in his blog post.)

Now, what are the options?

(1) Punt and leave it to the hypervisor.

This essentially means that you must use a hypervisor that already has
HA built in, such as VMware with the VCenter driver. In that scenario,
Nova itself neither deals with HA, nor exposes any HA switches to the
user. Obvious downside: not generic, doesn't work with all
hypervisors, most importantly doesn't work with the most popular one
(libvirt/KVM).

(2) Deploy Nova nodes in pairs/groups, and pretend that they are one node.

You can already do that by overriding host in nova-compute.conf,
setting resume_guests_state_on_host_boot, and using VIPs with
Corosync/Pacemaker. You can then group these hosts in host aggregates,
and the user's scheduler hint to point a newly scheduled guest to such
a host aggregate becomes, effectively, the keep this guest running at
all times flag. Upside: no changes to Nova at all, monitoring,
fencing and recovery for free from Corosync/Pacemaker. Downsides:
requires vendors to automate Pacemaker configuration in deployment
tools (because you really don't want to do those things manually).
Additional downside: you either have some idle hardware, or you might
be overcommitting resources in case of failover.

(3) Automatic host evacuation.

Not supported in Nova right now, as Adam pointed out at the top of the
thread, and repeatedly shot down. If someone were to implement this,
it would *still* require that Corosync/Pacemaker be used for
monitoring and fencing of nodes, because re-implementing this from
scratch would be the reinvention of a wheel while painting a bikeshed.

(4) Per-guest HA.

This is the idea of just doing nova boot --keep-this running, i.e.
setting a per-guest flag that still means the machine is to be kept up
at all times. Again, not supported in Nova right now, and probably
even more complex to implement generically than (3), at the same or
greater cost.

I have a suggestion to tackle this that I *think* is reasonably
user-friendly while still bearable in terms of Nova development
effort:

(a) Define a well-known metadata key for a host aggregate, say ha.
Define that any host aggregate that represents a highly available
group of compute nodes should have this metadata key set.

(b) Then define a flavor that sets extra_specs ha=true.

Granted, this places an additional burden on distro vendors to
integrate highly-available compute nodes into their deployment
infrastructure. But since practically all of them already include
Pacemaker, the additional scaffolding required is actually rather
limited.

Am I making sense?

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-15 Thread Florian Haas
On Wed, Oct 15, 2014 at 9:58 PM, Jay Pipes jaypi...@gmail.com wrote:
 On 10/15/2014 03:16 PM, Florian Haas wrote:

 On Wed, Oct 15, 2014 at 7:20 PM, Russell Bryant rbry...@redhat.com
 wrote:

 On 10/13/2014 05:59 PM, Russell Bryant wrote:

 Nice timing.  I was working on a blog post on this topic.


 which is now here:

 http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/


 I am absolutely loving the fact that we are finally having a
 discussion in earnest about this. i think this deserves a Design
 Summit session.

 If I may weigh in here, let me share what I've seen users do and what
 can currently be done, and what may be supported in the future.

 Problem: automatically ensure that a Nova guest continues to run, even
 if its host fails.

 (That's the general problem description and I don't need to go into
 further details explaining the problem, because Russell has done that
 beautifully in his blog post.)

 Now, what are the options?

 (1) Punt and leave it to the hypervisor.

 This essentially means that you must use a hypervisor that already has
 HA built in, such as VMware with the VCenter driver. In that scenario,
 Nova itself neither deals with HA, nor exposes any HA switches to the
 user. Obvious downside: not generic, doesn't work with all
 hypervisors, most importantly doesn't work with the most popular one
 (libvirt/KVM).

 (2) Deploy Nova nodes in pairs/groups, and pretend that they are one node.

 You can already do that by overriding host in nova-compute.conf,
 setting resume_guests_state_on_host_boot, and using VIPs with
 Corosync/Pacemaker. You can then group these hosts in host aggregates,
 and the user's scheduler hint to point a newly scheduled guest to such
 a host aggregate becomes, effectively, the keep this guest running at
 all times flag. Upside: no changes to Nova at all, monitoring,
 fencing and recovery for free from Corosync/Pacemaker. Downsides:
 requires vendors to automate Pacemaker configuration in deployment
 tools (because you really don't want to do those things manually).
 Additional downside: you either have some idle hardware, or you might
 be overcommitting resources in case of failover.

 (3) Automatic host evacuation.

 Not supported in Nova right now, as Adam pointed out at the top of the
 thread, and repeatedly shot down. If someone were to implement this,
 it would *still* require that Corosync/Pacemaker be used for
 monitoring and fencing of nodes, because re-implementing this from
 scratch would be the reinvention of a wheel while painting a bikeshed.

 (4) Per-guest HA.

 This is the idea of just doing nova boot --keep-this running, i.e.
 setting a per-guest flag that still means the machine is to be kept up
 at all times. Again, not supported in Nova right now, and probably
 even more complex to implement generically than (3), at the same or
 greater cost.

 I have a suggestion to tackle this that I *think* is reasonably
 user-friendly while still bearable in terms of Nova development
 effort:

 (a) Define a well-known metadata key for a host aggregate, say ha.
 Define that any host aggregate that represents a highly available
 group of compute nodes should have this metadata key set.

 (b) Then define a flavor that sets extra_specs ha=true.

 Granted, this places an additional burden on distro vendors to
 integrate highly-available compute nodes into their deployment
 infrastructure. But since practically all of them already include
 Pacemaker, the additional scaffolding required is actually rather
 limited.


 Or:

 (5) Let monitoring and orchestration services deal with these use cases and
 have Nova simply provide the primitive API calls that it already does (i.e.
 host evacuate).

That would arguably lead to an incredible amount of wheel reinvention
for node failure detection, service failure detection, etc. etc.

Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-15 Thread Florian Haas
On Wed, Oct 15, 2014 at 10:03 PM, Russell Bryant rbry...@redhat.com wrote:
 Am I making sense?

 Yep, the downside is just that you need to provide a new set of flavors
 for ha vs non-ha.  A benefit though is that it's a way to support it
 today without *any* changes to OpenStack.

Users are already very used to defining new flavors. Nova itself
wouldn't even need to define those; if the vendor's deployment tools
defined them it would be just fine.

 This seems like the kind of thing we should also figure out how to offer
 on a per-guest basis without needing a new set of flavors.  That's why I
 also listed the server tagging functionality as another possible solution.

This still doesn't do away with the requirement to reliably detect
node failure, and to fence misbehaving nodes. Detecting that a node
has failed, and fencing it if unsure, is a prerequisite for any
recovery action. So you need Corosync/Pacemaker anyway.

Note also that when using an approach where you have physically
clustered nodes, but you are also running non-HA VMs on those, then
the user must understand that the following applies:

(1) If your guest is marked HA, then it will automatically recover on
node failure, but
(2) if your guest is *not* marked HA, then it will go down with the
node not only if it fails, but also if it is fenced.

So a non-HA guest on an HA node group actually has a slightly
*greater* chance of going down than a non-HA guest on a non-HA host.
(And let's not get into don't use fencing then; we all know why
that's a bad idea.)

Which is why I think it makes sense to just distinguish between
HA-capable and non-HA-capable hosts, and have the user decide whether
they want HA or non-HA guests simply by assigning them to the
appropriate host aggregates.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
 (5) Let monitoring and orchestration services deal with these use
 cases and
 have Nova simply provide the primitive API calls that it already does
 (i.e.
 host evacuate).

 That would arguably lead to an incredible amount of wheel reinvention
 for node failure detection, service failure detection, etc. etc.

 How so? (5) would use existing wheels for monitoring and orchestration
 instead of writing all new code paths inside Nova to do the same thing.

 Right, there may be some confusion here ... I thought you were both
 agreeing that the use of an external toolset was a good approach for the
 problem, but Florian's last message makes that not so clear ...

While one of us (Jay or me) speaking for the other and saying we agree
is a distributed consensus problem that dwarfs the complexity of
Paxos, *I* for my part do think that an external toolset (i.e. one
that lives outside the Nova codebase) is the better approach versus
duplicating the functionality of said toolset in Nova.

I just believe that the toolset that should be used here is
Corosync/Pacemaker and not Ceilometer/Heat. And I believe the former
approach leads to *much* fewer necessary code changes *in* Nova than
the latter.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 5:04 AM, Russell Bryant rbry...@redhat.com wrote:
 On 10/15/2014 05:07 PM, Florian Haas wrote:
 On Wed, Oct 15, 2014 at 10:03 PM, Russell Bryant rbry...@redhat.com wrote:
 Am I making sense?

 Yep, the downside is just that you need to provide a new set of flavors
 for ha vs non-ha.  A benefit though is that it's a way to support it
 today without *any* changes to OpenStack.

 Users are already very used to defining new flavors. Nova itself
 wouldn't even need to define those; if the vendor's deployment tools
 defined them it would be just fine.

 Yes, I know Nova wouldn't need to define it.  I was saying I didn't like
 that it was required at all.

Fair enough, but do consider that, for example, Trove already
routinely defines flavors of its own.

So I don't think that's quite as painful (to users) as you think.

 This seems like the kind of thing we should also figure out how to offer
 on a per-guest basis without needing a new set of flavors.  That's why I
 also listed the server tagging functionality as another possible solution.

 This still doesn't do away with the requirement to reliably detect
 node failure, and to fence misbehaving nodes. Detecting that a node
 has failed, and fencing it if unsure, is a prerequisite for any
 recovery action. So you need Corosync/Pacemaker anyway.

 Obviously, yes.  My post covered all of that directly ... the tagging
 bit was just additional input into the recovery operation.

This is essentially why I am saying using the Pacemaker stack is the
smarter approach than hacking something into Ceilometer and Heat. You
already need Pacemaker for service availability (and all major vendors
have adopted it for that purpose), so a highly available cloud that
does *not* use Pacemaker at all won't be a vendor supported option for
some time. So people will already be running Pacemaker — then why not
use it for what it's good at?

(Yes, I am aware of things like etcd and fleet. I think that's headed
in the right direction, but hasn't nearly achieved the degree of
maturity that Pacemaker has. All of HA is about performing correctly
in weird corner cases, and you're only able to do that if you've run
into them and got your nose bloody.)

And just so my position is clear, what Pacemaker is good at is node
and service monitoring, recovery, and fencing. It's *not* particularly
good at usability. Which is why it makes perfect sense to not have
your Pacemaker configurations managed directly by a human, but have an
automated deployment facility do it. Which the vendors are already
doing.

 Note also that when using an approach where you have physically
 clustered nodes, but you are also running non-HA VMs on those, then
 the user must understand that the following applies:

 (1) If your guest is marked HA, then it will automatically recover on
 node failure, but
 (2) if your guest is *not* marked HA, then it will go down with the
 node not only if it fails, but also if it is fenced.

 So a non-HA guest on an HA node group actually has a slightly
 *greater* chance of going down than a non-HA guest on a non-HA host.
 (And let's not get into don't use fencing then; we all know why
 that's a bad idea.)

 Which is why I think it makes sense to just distinguish between
 HA-capable and non-HA-capable hosts, and have the user decide whether
 they want HA or non-HA guests simply by assigning them to the
 appropriate host aggregates.

 Very good point.  I hadn't considered that.

Yay, I've contributed something useful to this discussion then. :)

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 9:25 AM, Jastrzebski, Michal
michal.jastrzeb...@intel.com wrote:
 In my opinion flavor defining is a bit hacky. Sure, it will provide us
 functionality fairly quickly, but also will strip us from flexibility Heat
 would give. Healing can be done in several ways, simple destroy - create
 (basic convergence workflow so far), evacuate with or without
 shared storage, even rebuild vm, probably few more when we put more thoughts
 to it.

But then you'd also need to monitor the availability of *individual*
guest and down you go the rabbit hole.

So suppose you're monitoring a guest with a simple ping. And it stops
responding to that ping.

(1) Has it died?
(2) Is it just too busy to respond to the ping?
(3) Has its guest network stack died?
(4) Has its host vif died?
(5) Has the L2 agent on the compute host died?
(6) Has its host network stack died?
(7) Has the compute host died?

Suppose further it's using shared storage (running off an RBD volume
or using an iSCSI volume, or whatever). Now you have almost as many
recovery options as possible causes for the failure, and some of those
recovery options will potentially destroy your guest's data.

No matter how you twist and turn the problem, you need strongly
consistent distributed VM state plus fencing. In other words, you need
a full blown HA stack.

 I'd rather use nova for low level task and maybe low level monitoring (imho
 nova should do that using servicegroup). But I'd use something more more
 configurable for actual task triggering like heat. That would give us
 framework rather than mechanism. Later we might want to apply HA on network or
 volume, then we'll have mechanism ready just monitoring hook and healing
 will need to be implemented.

 We can use scheduler hints to place resource on host HA-compatible
 (whichever health action we'd like to use), this will bit more complicated, 
 but
 also will give us more flexibility.

I apologize in advance for my bluntness, but this all sounds to me
like you're vastly underrating the problem of reliable guest state
detection and recovery. :)

 I agree that we all should meet in Paris and discuss that so we can join our
 forces. This is one of bigger gaps to be filled imho.

Pretty much every user I've worked with in the last 2 years agrees.
Granted, my view may be skewed as HA is typically what customers
approach us for in the first place, but yes, this definitely needs a
globally understood and supported solution.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 11:01 AM, Thomas Herve
thomas.he...@enovance.com wrote:

  This still doesn't do away with the requirement to reliably detect
  node failure, and to fence misbehaving nodes. Detecting that a node
  has failed, and fencing it if unsure, is a prerequisite for any
  recovery action. So you need Corosync/Pacemaker anyway.
 
  Obviously, yes.  My post covered all of that directly ... the tagging
  bit was just additional input into the recovery operation.

 This is essentially why I am saying using the Pacemaker stack is the
 smarter approach than hacking something into Ceilometer and Heat. You
 already need Pacemaker for service availability (and all major vendors
 have adopted it for that purpose), so a highly available cloud that
 does *not* use Pacemaker at all won't be a vendor supported option for
 some time. So people will already be running Pacemaker — then why not
 use it for what it's good at?

 I may be missing something, but Pacemaker will only provide monitoring of 
 your compute node, right? I think the advantage you would get by using 
 something like Heat is having an instance agent and provide monitoring of 
 your client service, instead of just knowing the status of your hypervisor. 
 Hosts can fail, but there is another array of failures that you can't handle 
 with the global deployment monitoring.

You *are* missing something, indeed. :) Pacemaker would be a perfectly
fine tool for also monitoring the status of your guests on the hosts.
So arguably, nova-compute could in fact hook in with pcsd
(https://github.com/feist/pcs/tree/master/pcs -- all in Python) down
the road to inject VM monitoring into the Pacemaker configuration.
This would, of course, need to be specific to the hypervisor so it
would be a job for the nova driver, rather than being implemented at
the nova-compute level.

But my hunch is that that sort of thing would be for the L release;
for Kilo the low-hanging fruit would be to defend against host failure
(meaning, compute node failure, unrecoverable nova-compute service
failure, etc.).

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 1:59 PM, Russell Bryant rbry...@redhat.com wrote:
 On 10/16/2014 04:29 AM, Florian Haas wrote:
 (5) Let monitoring and orchestration services deal with these use
 cases and
 have Nova simply provide the primitive API calls that it already does
 (i.e.
 host evacuate).

 That would arguably lead to an incredible amount of wheel reinvention
 for node failure detection, service failure detection, etc. etc.

 How so? (5) would use existing wheels for monitoring and orchestration
 instead of writing all new code paths inside Nova to do the same thing.

 Right, there may be some confusion here ... I thought you were both
 agreeing that the use of an external toolset was a good approach for the
 problem, but Florian's last message makes that not so clear ...

 While one of us (Jay or me) speaking for the other and saying we agree
 is a distributed consensus problem that dwarfs the complexity of
 Paxos, *I* for my part do think that an external toolset (i.e. one
 that lives outside the Nova codebase) is the better approach versus
 duplicating the functionality of said toolset in Nova.

 I just believe that the toolset that should be used here is
 Corosync/Pacemaker and not Ceilometer/Heat. And I believe the former
 approach leads to *much* fewer necessary code changes *in* Nova than
 the latter.

 Have you tried pacemaker_remote yet?  It seems like a better choice for
 this particular case, as opposed to using corosync, due to the potential
 number of compute nodes.

I'll assume that you are *not* referring to running Corosync/Pacemaker
on the compute nodes plus pacemaker_remote in the VMs, because doing
so would blow up the separation between the cloud operator and tenant
space.

Running compute nodes as baremetal extensions of a different
Corosync/Pacemaker cluster (presumably the one that manages the other
Nova services)  would potentially be an option, although vendors would
need to buy into this. Ubuntu, for example, currently only ships
pacemaker-remote in universe.

*If* you're running pacemaker_remote on the compute node, though, that
then also opens up the possibility for a compute driver to just dump
the libvirt definition into a VirtualDomain Pacemaker resource,
meaning with a small callout added to Nova, you could also get the
virtual machine monitoring functionality. Bonus: this could eventually
be extended to allow live migration of guests to other compute nodes
in the same cluster, in case you want to shut down a compute node for
maintenance without interrupting your HA guests.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 4:31 PM, Steve Gordon sgor...@redhat.com wrote:
 Running compute nodes as baremetal extensions of a different
 Corosync/Pacemaker cluster (presumably the one that manages the other
 Nova services)  would potentially be an option, although vendors would
 need to buy into this. Ubuntu, for example, currently only ships
 pacemaker-remote in universe.

 This is something we'd be doing *too* OpenStack rather than *in* the 
 OpenStack projects (at least those that deliver code), in fact that's a large 
 part of the appeal. As such I don't know that there necessarily has to be one 
 true solution to rule them all, a distribution could deviate as needed, but 
 we would have some - ideally very small - number of known good 
 configurations which achieve the stated goal and are well documented.

Correct. In the infrastructure/service HA field, we already have that,
as vendors (with very few exceptions) have settled on
Corosync/Pacemaker for service availability, HAproxy for load
balancing, and MySQL/Galera for database replication, for example. It
would be great if we could see this kind of convergent evolution for
guest HA as well.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 7:03 PM, Adam Lawson alaw...@aqorn.com wrote:

 Be forewarned; here's my two cents before I've had my morning coffee.

 It would seem to me that if we were seeking some level of resiliency against 
 host failures (if a host fails, evacuate the instances that were hosted on it 
 to a host that isn't broken), it would seem that host HA is a good approach. 
 The ultimate goal of course is instance HA but the task of monitoring 
 individual instances and determining what constitutes down seems like a 
 much more complex task than detecting when a compute node is down. I know 
 that requiring the presence of agents should probably need some more 
 brain-cycles since we can't expect additional bytes consuming memory on each 
 individual VM.

What Russell is suggesting, though, is actually a very feasible
approach for compute node HA today and per-instance HA tomorrow.

 Additionally, I'm not really hung up on the 'how' as we all realize there 
 several ways to skin that cat, so long as that 'how' is leveraged via tools 
 over which we have control and direct influence. Reason being, we may not 
 want to leverage features as important as this on tools that change outside 
 our control and subsequently shifts the foundation of the feature we 
 implemented that was based on how the product USED to work. Basically if 
 Pacemaker does what we need then cool but it seems that implementing a 
 feature should be built upon a bedrock of programs over which we have a 
 direct influence.

That almost sounds a bit like let's always build a better wheel,
because control. I'm not sure if that's indeed the intention, but if
it is then that seems like a bad idea to me.

 This is why Nagios may be able to do it but it's a hack at best. I'm not 
 saying Nagios isn't good or ythe hack doesn't work but in the context of an 
 Openstack solution, we can't require a single external tool for a feature 
 like host or VM HA. Are we suggesting that we tell people who want HA - go 
 use Nagios? Call me a purist but if we're going to implement a feature, it 
 should be our community implementing it because we have some of the best 
 minds on staff. ; )

Anyone who thinks that having a monitoring solution to page people and
then waking up a human to restart the service constitutes HA needs to
be doused in a bucket of ice water. :)

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 7:48 PM, Jay Pipes jaypi...@gmail.com wrote:
 While one of us (Jay or me) speaking for the other and saying we agree
 is a distributed consensus problem that dwarfs the complexity of
 Paxos


 You've always had a way with words, Florian :)

I knew you'd like that one. :)

, *I* for my part do think that an external toolset (i.e. one

 that lives outside the Nova codebase) is the better approach versus
 duplicating the functionality of said toolset in Nova.

 I just believe that the toolset that should be used here is
 Corosync/Pacemaker and not Ceilometer/Heat. And I believe the former
 approach leads to *much* fewer necessary code changes *in* Nova than
 the latter.


 I agree with you that Corosync/Pacemaker is the tool of choice for
 monitoring/heartbeat functionality, and is my choice for compute-node-level
 HA monitoring. For guest-level HA monitoring, I would say use
 Heat/Ceilometer. For container-level HA monitoring, it looks like fleet or
 something like Kubernetes would be a good option.

Here's why I think that's a bad idea: none of these support the
concept of being subordinate to another cluster.

Again, suppose a VM stops responding. Then
Heat/Ceilometer/Kubernetes/fleet would need to know whether the node
hosting the VM is down or not. Only if the node is up or recovered
(which Pacemaker would be reponsible for) the VM HA facility would be
able to kick in. Effectively you have two views of the cluster
membership, and that sort of thing always gets messy. In the HA space
we're always facing the same issues when a replication facility
(Galera, GlusterFS, DRBD, whatever) has a different view of the
cluster membership than the cluster manager itself — which *always*
happens for a few seconds on any failover, recovery, or fencing event.

Russell's suggestion, by having remote Pacemaker instances on the
compute nodes tie in with a Pacemaker cluster on the control nodes,
does away with that discrepancy.

 I'm curious to see how the combination of compute-node-level HA and
 container-level HA tools will work together in some of the proposed
 deployment architectures (bare metal + docker containers w/ OpenStack and
 infrastructure services run in a Kubernetes pod or CoreOS fleet).

I have absolutely nothing against an OpenStack cluster using
*exclusively* Kubernetes or fleet for HA management, once those have
reached sufficient maturity. But just about every significant
OpenStack distro out there has settled on Corosync/Pacemaker for the
time being. Let's not shove another cluster manager down their throats
for little to no real benefit.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-16 Thread Florian Haas
On Thu, Oct 16, 2014 at 9:40 PM, Russell Bryant rbry...@redhat.com wrote:
 On 10/16/2014 02:40 PM, Adam Lawson wrote:
 Question: is host HA not achievable using the programs we have in place
 now (with modification of course)? If not, I'm still a champion to see
 it done within our four walls.

 Yes, it is achievable (without modification, even).

 That was the primary point of:

   http://blog.russellbryant.net/2014/10/15/openstack-instance-ha-proposal/

 I think there's work to do to build up a reference configuration, test
 it out, and document it.  I believe all the required software exists and
 is already in use in many OpenStack deployments for other reasons.

+1.

Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Nova] Automatic evacuate

2014-10-17 Thread Florian Haas
On Fri, Oct 17, 2014 at 9:53 AM, Jastrzebski, Michal
michal.jastrzeb...@intel.com wrote:


 -Original Message-
 From: Florian Haas [mailto:flor...@hastexo.com]
 Sent: Thursday, October 16, 2014 10:53 AM
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] [Nova] Automatic evacuate

 On Thu, Oct 16, 2014 at 9:25 AM, Jastrzebski, Michal
 michal.jastrzeb...@intel.com wrote:
  In my opinion flavor defining is a bit hacky. Sure, it will provide us
  functionality fairly quickly, but also will strip us from flexibility
  Heat would give. Healing can be done in several ways, simple destroy
  - create (basic convergence workflow so far), evacuate with or
  without shared storage, even rebuild vm, probably few more when we put
  more thoughts to it.

 But then you'd also need to monitor the availability of *individual* guest 
 and
 down you go the rabbit hole.

 So suppose you're monitoring a guest with a simple ping. And it stops
 responding to that ping.

 I was more reffering to monitoring host (not guest), and for sure not by ping.
 I was thinking of current zookeeper service group implementation, we might 
 want
 to use corosync and write servicegroup plugin for that. There are several 
 choices
 for that, each requires testing really before we make any decission.

 There is also fencing case, which we agree is important, and I think nova 
 should
 be able to do that (since it does evacuate, it also should do a fencing). But
 for working fencing we really need working host health monitoring, so I 
 suggest
 we take baby steps here and solve one issue at the time. And that would be 
 host
 monitoring.

You're describing all of the cases for which Pacemaker is the perfect
fit. Sorry, I see absolutely no point in teaching Nova to do that.

 (1) Has it died?
 (2) Is it just too busy to respond to the ping?
 (3) Has its guest network stack died?
 (4) Has its host vif died?
 (5) Has the L2 agent on the compute host died?
 (6) Has its host network stack died?
 (7) Has the compute host died?

 Suppose further it's using shared storage (running off an RBD volume or
 using an iSCSI volume, or whatever). Now you have almost as many recovery
 options as possible causes for the failure, and some of those recovery
 options will potentially destroy your guest's data.

 No matter how you twist and turn the problem, you need strongly consistent
 distributed VM state plus fencing. In other words, you need a full blown HA
 stack.

  I'd rather use nova for low level task and maybe low level monitoring
  (imho nova should do that using servicegroup). But I'd use something
  more more configurable for actual task triggering like heat. That
  would give us framework rather than mechanism. Later we might want to
  apply HA on network or volume, then we'll have mechanism ready just
  monitoring hook and healing will need to be implemented.
 
  We can use scheduler hints to place resource on host HA-compatible
  (whichever health action we'd like to use), this will bit more
  complicated, but also will give us more flexibility.

 I apologize in advance for my bluntness, but this all sounds to me like 
 you're
 vastly underrating the problem of reliable guest state detection and
 recovery. :)

 Guest health in my opinion is just a bit out of scope here. If we'll have 
 robust
 way of detecting host health, we can pretty much asume that if host dies, 
 guests follow.
 There are ways to detect guest health (libvirt watchdog, ceilometer, ping you 
 mentioned),
 but that should be done somewhere else. And for sure not by evacuation.

You're making an important point here; you're asking for a robust way
of detecting host health. I can guarantee you that the way of
detecting host health that you suggest (i.e. from within Nova) will
not be robust by HA standards for at least two years, if your patch
lands tomorrow.

Cheers,
Florian

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] [tc] Technical Committee Vision draft

2017-04-14 Thread Florian Haas
On Wed, Apr 5, 2017 at 11:46 AM, Thierry Carrez  wrote:
> Hi everyone,
>
> Last year in Ann Arbor, a group of OpenStack community members
> (including 6 current TC members) attended a Servant Leadership training
> at ZingTrain organized by Colette Alexander and funded by the OpenStack
> Foundation. We found that these concepts adapted quite well to our
> unique environment. The Stewardship working group was created to try to
> further those and decided to further those efforts. One of the tools we
> learned about there is the concept of building a "vision" to define a
> desirable future for a group of people, and to inform future choices on
> our way there.
>
> In any virtual and global community, there are challenges around
> confusion, isolation and fragmentation. OpenStack does not escape those,
> and confusion on where we are going and what we are trying to achieve is
> common. Vision is a tool that can help with that. We decided to start
> with creating a vision for the Technical Committee. What would success
> for that group of people look like ? If that exercise is successful (and
> useful), we could move on to write a vision for OpenStack itself.
>
> Members of the Technical Committee met in person around a Board+TC
> meeting in Boston last month, to start building this vision. Then over
> the last month we refined this document with the TC members that could
> not make it in person in Boston. Sean Dague polished the wording and
> posted the resulting draft at:
>
> https://review.openstack.org/#/c/453262/
>
> Now we are entering a (long) comment phase. This includes comments on
> the review, face-to-face discussions at the Forum, but also (soon) an
> open survey for confidential feedback. We'd very much like to hear your
> opinion on it.

Thanks for sharing this Thierry, and thanks to everyone putting it together.

As many others, I have received the survey asking for feedback, and
I've also taken a look at the Gerrit change and the ensuing discussion
there. I am not sure whether Gerrit is the appropriate place to make
the following comment, so I'm taking the liberty to post this here.

This is a vision that is set out for the next couple of years. Taking
into account the size (and thus inherent inertia) of the OpenStack
community, I wonder if the goals staked out in the vision are in any
way realistic to achieve in the time allotted.

To me, it looks more like a 5-year vision than a 2-year one. In other
words, the changes staked out, to me, are more comparable to what
happened in OpenStack between 2012 and now, not between 2015 and now,
and so I have my doubts about how they would fit between now and 2019.

Now I have absolutely no objections to aiming high, and many goals in
the vision are entirely worthy of pursuit. But if you were to set out
to do something that is fundamentally infeasible in the time allotted,
then all you'd be heading for is frustration. In fact, it would run
the risk of largely discounting the vision as a whole — people are far
less likely to buy into something where compared to what is staked
out, the realistic expectation is to achieve maybe half of it, or to
take twice the time. I think this vision runs a considerable risk of
that.

I wasn't at the leadership training and I don't know what was
discussed there. So I'm wondering if you could share whether this was
a topic of discussion, and whether perhaps people spoke up and said
that this is more of a five year plan that should now be broken down
into achievable goals with a time frame of 1-2 years each.

Thank you!

Cheers,
Florian

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [all] [tc] Technical Committee Vision draft

2017-04-14 Thread Florian Haas
On Fri, Apr 14, 2017 at 12:55 PM, Jeremy Stanley  wrote:
> It's intentionally ambitious, yes, because we want to inspire and be
> inspired to great achievements.

I generally don't think that that approach works for a large
community, except in the rare cases of where you have both an utterly
awe-inspiring goal and a one-sentence definition of what "done" means
— like "before this decade is out, land a man on the moon and return
him safely to the earth" —, but I fully appreciate that people will
strongly disagree with me on that one. So let's not get into that
discussion. :)

> At the same time, the comment period
> and public review process are totally about getting some grounding
> in reality from the community, keeping us honest with ourselves as
> to what is or is not a reasonable goal (seeking exactly the sorts of
> analysis you've provided here). We want to be sure both that our
> choices of focus reflect the people we've been elected to represent,
> and that those same people can see some possibility for reaching
> these goals.
>
> So to turn this around, if we were to keep it at a 2-year vision
> do you believe we should lower our target metrics or reduce the
> number of things we're seeking to accomplish through the technical
> community (or a bit of both)?

I don't like to think in "target metrics", but looking through the
draft there are several items which look like ambitious 2-year goals
by themselves:

- Constellations. This is an extremely impactful goal where, I
believe, some organizations with entrenched business practices would
take 2 years to come aboard *even if upstream had already completely
decided right now.* OpenStack distributions as offered by vendors are,
currently, normally general-purpose, and to tailor this to specific
reference architectures is a massive undertaking for an organization
(including its support engineers, QA/QE people, presales, and sales
people). Consider that the vision draft talks about a world where
constellations already "have become the new standard way to start
exploring
OpenStack." If in 2 years constellations are already meant to the new
and accepted standard, that requires all hands on deck right now.

- Multi-language outreach. Yes we do have OpenStack and
OpenStack-related code in other languages like Go and Erlang, but
adjacent communities perceive OpenStack as primarily a Python project.
Many non-Python OpenStack SDKs were lagging behind the Python ones for
so long that they were barely usable, and we need to win back a lot of
trust from non-Python communities even after "Go, Nodejs, or Java"
support is on par with, or comparable to, Python. Again, if you want
to convince those communities that they are now first-class citizens
in OpenStack land, that would take 2 years by itself, I'd imagine.

- Adjacent communities, and using OpenStack code in non-OpenStack
environments. The vision calls for "thinking differently about
adjacent communities". Again, this is a massive community-wide
undertaking, and much more easily said than done. Yes, some projects
(Swift comes to mind, as does Ironic) have made a conscious effort to
be valuable independent of an OpenStack environment. In others, we've
seen an effort that has since died down somewhat — the last time I
heard serious discussions about standalone Heat, for example, was in
2014.

Now, you could argue that it's a big community, we can do all those
things in parallel. Parallelization is problematic to take for granted
in collaboration (cf. Fred Brooks, The Mythical Man-Month, 1975), but
even if we assert that it can be done (with a lot of effort), then it
only makes sense for goals that do not run counter to each other.
Constellations are all about standardization, which cuts down on
flexibility, multi-language outreach is the opposite. Incorporation of
OpenStack code into non-OpenStack projects is also counter to
standardization, or rather requires going by the rules of said
projects, not OpenStack, and thus again runs counter to the goals of
constellations.

My humble opinion here is pick one goal. As for the others, you'll
have to see how they play out. Then, in two years, reassess and pick
the next goal. Thus, to answer your question, I'd say reduce not the
number of things you're seeking to accomplish through the technical
community, but through the explicit guidance of the technical
*committee.*

Cheers,
Florian

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev