Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-27 Thread Robert Collins
On 27 January 2014 18:08, Joshua Harlow harlo...@yahoo-inc.com wrote:
 Thanks, guess this is entering the realm of scheduling  group scheduling and 
 how just the right level of information is needed to do efficient group 
 scheduling in nova/ironic vs the new/upcoming gantt service.

 To me splitting it into N single requests isn't group scheduling but is just 
 more of a batch processor to make things more parallel. To me it seems like 
 gantt (or heat) or something else should know enough about the topology to 
 identify where to schedule a request (or a group request) and then gantt/heat 
 should pass enough location information to nova or ironic to let it know what 
 was selected. Then nova or ironic can go about the dirty work of ensuring the 
 instances were created reliably Of course it gets complicated when 
 multiple resources are involved; but nobody said it was going to be easy ;)

Right, it does get complex. One variation for instance - get a
reservation from the scheduler which can schedule across network/block
storage/compute and then follow that up with individual deployment
requests that are tagged with the reservation. That way we don't need
a mega-API that dispatches out to everything.

Anyhow - I believe no matter what that this is future work and Ironic
should be driven by the evolving design, rather than assuming anything
about batch or not batch and changing in advance.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-26 Thread Devananda van der Veen
On Sat, Jan 25, 2014 at 7:11 AM, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Robert Collins's message of 2014-01-25 02:47:42 -0800:
  On 25 January 2014 19:42, Clint Byrum cl...@fewbar.com wrote:
   Excerpts from Robert Collins's message of 2014-01-24 18:48:41 -0800:
 
However, in looking at how Ironic works and interacts with Nova, it
doesn't seem like there is any distinction of data per-compute-node
inside Ironic.  So for this to work, I'd have to run a whole bunch of
ironic instances, one per compute node. That seems like something we
don't want to do.
  
   Huh?
  
  
   I can't find anything in Ironic that lets you group nodes by anything
   except chassis. It was not a serious discussion of how the problem would
   be solved, just a point that without some way to tie ironic nodes to
   compute-nodes I'd have to run multiple ironics.
 
  I don't understand the point. There is no tie between ironic nodes and
  compute nodes. Why do you want one?
 

 Because sans Ironic, compute-nodes still have physical characteristics
 that make grouping on them attractive for things like anti-affinity. I
 don't really want my HA instances not on the same compute node, I want
 them not in the same failure domain. It becomes a way for all
 OpenStack workloads to have more granularity than availability zone.

Yes, and with Ironic, these same characteristics are desirable but are
no longer properties of a nova-compute node; they're properties of the
hardware which Ironic manages.

In principle, the same (hypothetical) failure-domain-aware scheduling
could be done if Ironic is exposing the same sort of group awareness,
as long as the nova 'ironic driver is passing that information up to
the scheduler in a sane way. In which case, Ironic would need to be
representing such information, even if it's not acting on it, which I
think is trivial for us to do.


 So if we have all of that modeled in compute-nodes, then when adding
 physical hardware to Ironic one just needs to have something to model
 the same relationship for each physical hardware node. We don't have to
 do it by linking hardware nodes to compute-nodes, but that would be
 doable for a first cut without much change to Ironic.


You're trading failure-domain awareness for fault-tolerance in your
control plane. by binding hardware to nova-compute. Ironic is designed
explicitly to decouple the instances of Ironic (and Nova) within the
control plane from the hardware it's managing. This is one of the main
shortcomings of nova baremetal, and it doesn't seem like a worthy
trade, even for a first approximation.

   The changes to Nova would be massive and invasive as they would be
   redefining the driver apiand all the logic around it.
  
  
   I'm not sure I follow you at all. I'm suggesting that the scheduler have
   a new thing to filter on, and that compute nodes push their unique ID
   down into the Ironic driver so that while setting up nodes in Ironic one
   can assign them to a compute node. That doesn't sound massive and
   invasive.

This is already being done *within* Ironic as nodes are mapped
dynamically to ironic-conductor instances; the coordination for
failover/takeover needs to be improved, but that's incremental at this
point. Moving this mapping outside of Ironic is going to be messy and
complicated, and breaks the abstraction layer. The API change may seem
small, but it will massively overcomplicate Nova by duplicating all
the functionality of Ironic-conductor in another layer of the stack.

 
  I think we're perhaps talking about different things - in the section
  you were answering, I thought he was talking about whether the API
  should offer operations on arbitrary sets of nodes at once, or whether
  each operation should be a separate API call vs what I now think you
  were talking about which was whether operations should be able to
  describe logical relations to other instances/nodes. Perhaps if we use
  the term 'batch' rather than 'group' to talk about the
  multiple-things-at-once aspect, and grouping to talk about the
  primarily scheduler related problems of affinity / anti affinity etc,
  we can avoid future confusion.
 

 Yes, thats a good point. I was talking about modeling failure domains
 only.  Batching API requests seems like an entirely different thing.


I was conflating these terms in that I was talking about grouping
actions (batching) and groups of nodes (groups). That said, there
are really three distinct topics here. Let's break groups down
further: logical group for failure domains, and hardware group for
hardware which is physically interdependent in such a way that changes
to one node affect other node(s).


Regards,
Deva

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-26 Thread Joshua Harlow
Doesn't nova already have logic for creating N virtual machines (similar to a 
group) in the same request? I thought it did (maybe it doesn't anymore in the 
v3 API), creating N bare metal machines seems like it would comply to that api?

Sent from my really tiny device...

 On Jan 22, 2014, at 4:50 PM, Devananda van der Veen 
 devananda@gmail.com wrote:
 
 So, a conversation came again up today around whether or not Ironic will, in 
 the future, support operations on groups of nodes. Some folks have expressed 
 a desire for Ironic to expose operations on groups of nodes; others want 
 Ironic to host the hardware-grouping data so that eg. Heat and Tuskar can 
 make more intelligent group-aware decisions or represent the groups in a UI. 
 Neither of these have an implementation in Ironic today... and we still need 
 to implement a host of other things before we start on this. FWIW, this 
 discussion is meant to stimulate thinking ahead to things we might address in 
 Juno, and aligning development along the way.
 
 There's also some refactoring / code cleanup which is going on and worth 
 mentioning because it touches the part of the code which this discussion 
 impacts. For our developers, here is additional context:
 * our TaskManager class supports locking 1 node atomically, but both the 
 driver API and our REST API only support operating on one node at a time. 
 AFAIK, no where in the code do we actually pass a group of nodes.
 * for historical reasons, our driver API requires both a TaskManager and a 
 Node object be passed to all methods. However, the TaskManager object 
 contains a reference to the Node(s) which it has acquired, so the node 
 parameter is redundant.
 * we've discussed cleaning this up, but I'd like to avoid refactoring the 
 same interfaces again when we go to add group-awareness.
 
 
 I'll try to summarize the different axis-of-concern around which the 
 discussion of node groups seem to converge...
 
 1: physical vs. logical grouping
 - Some hardware is logically, but not strictly physically, grouped. Eg, 1U 
 servers in the same rack. There is some grouping, such as failure domain, but 
 operations on discrete nodes are discreet. This grouping should be modeled 
 somewhere, and some times a user may wish to perform an operation on that 
 group. Is a higher layer (tuskar, heat, etc) sufficient? I think so.
 - Some hardware _is_ physically grouped. Eg, high-density cartridges which 
 share firmware state or a single management end-point, but are otherwise 
 discrete computing devices. This grouping must be modeled somewhere, and 
 certain operations can not be performed on one member without affecting all 
 members. Things will break if each node is treated independently.
 
 2: performance optimization
 - Some operations may be optimized if there is an awareness of concurrent 
 identical operations. Eg, deploy the same image to lots of nodes using 
 multicast or bittorrent. If Heat were to inform Ironic that this deploy is 
 part of a group, the optimization would be deterministic. If Heat does not 
 inform Ironic of this grouping, but Ironic infers it (eg, from timing of 
 requests for similar actions) then optimization is possible but 
 non-deterministic, and may be much harder to reason about or debug.
 
 3: APIs
 - Higher layers of OpenStack (eg, Heat) are expected to orchestrate discrete 
 resource units into a larger group operation. This is where the grouping 
 happens today, but already results in inefficiencies when performing 
 identical operations at scale. Ironic may be able to get around this by 
 coalescing adjacent requests for the same operation, but this would be 
 non-deterministic.
 - Moving group-awareness or group-operations into the lower layers (eg, 
 Ironic) looks like it will require non-trivial changes to Heat and Nova, and, 
 in my opinion, violates a layer-constraint that I would like to maintain. On 
 the other hand, we could avoid the challenges around coalescing. This might 
 be necessary to support physically-grouped hardware anyway, too.
 
 
 If Ironic coalesces requests, it could be done in either the ConductorManager 
 layer or in the drivers themselves. The difference would be whether our 
 internal driver API accepts one node or a set of nodes for each operation. 
 It'll also impact our locking model. Both of these are implementation details 
 that wouldn't affect other projects, but would affect our driver developers.
 
 Also, until Ironic models physically-grouped hardware relationships in some 
 internal way, we're going to have difficulty supporting that class of 
 hardware. Is that OK? What is the impact of not supporting such hardware? It 
 seems, at least today, to be pretty minimal.
 
 
 Discussion is welcome.
 
 -Devananda
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-26 Thread Clint Byrum
Excerpts from Devananda van der Veen's message of 2014-01-26 10:27:36 -0800:
 On Sat, Jan 25, 2014 at 7:11 AM, Clint Byrum cl...@fewbar.com wrote:
 
  Excerpts from Robert Collins's message of 2014-01-25 02:47:42 -0800:
   On 25 January 2014 19:42, Clint Byrum cl...@fewbar.com wrote:
Excerpts from Robert Collins's message of 2014-01-24 18:48:41 -0800:
  
 However, in looking at how Ironic works and interacts with Nova, it
 doesn't seem like there is any distinction of data per-compute-node
 inside Ironic.  So for this to work, I'd have to run a whole bunch of
 ironic instances, one per compute node. That seems like something we
 don't want to do.
   
Huh?
   
   
I can't find anything in Ironic that lets you group nodes by anything
except chassis. It was not a serious discussion of how the problem would
be solved, just a point that without some way to tie ironic nodes to
compute-nodes I'd have to run multiple ironics.
  
   I don't understand the point. There is no tie between ironic nodes and
   compute nodes. Why do you want one?
  
 
  Because sans Ironic, compute-nodes still have physical characteristics
  that make grouping on them attractive for things like anti-affinity. I
  don't really want my HA instances not on the same compute node, I want
  them not in the same failure domain. It becomes a way for all
  OpenStack workloads to have more granularity than availability zone.
 
 Yes, and with Ironic, these same characteristics are desirable but are
 no longer properties of a nova-compute node; they're properties of the
 hardware which Ironic manages.
 

I agree, but I don't see any of that reflected in Ironic's API. I see
node CRUD, but not filtering or scheduling of any kind.

 In principle, the same (hypothetical) failure-domain-aware scheduling
 could be done if Ironic is exposing the same sort of group awareness,
 as long as the nova 'ironic driver is passing that information up to
 the scheduler in a sane way. In which case, Ironic would need to be
 representing such information, even if it's not acting on it, which I
 think is trivial for us to do.
 
 
  So if we have all of that modeled in compute-nodes, then when adding
  physical hardware to Ironic one just needs to have something to model
  the same relationship for each physical hardware node. We don't have to
  do it by linking hardware nodes to compute-nodes, but that would be
  doable for a first cut without much change to Ironic.
 
 
 You're trading failure-domain awareness for fault-tolerance in your
 control plane. by binding hardware to nova-compute. Ironic is designed
 explicitly to decouple the instances of Ironic (and Nova) within the
 control plane from the hardware it's managing. This is one of the main
 shortcomings of nova baremetal, and it doesn't seem like a worthy
 trade, even for a first approximation.
 
The changes to Nova would be massive and invasive as they would be
redefining the driver apiand all the logic around it.
   
   
I'm not sure I follow you at all. I'm suggesting that the scheduler have
a new thing to filter on, and that compute nodes push their unique ID
down into the Ironic driver so that while setting up nodes in Ironic one
can assign them to a compute node. That doesn't sound massive and
invasive.
 
 This is already being done *within* Ironic as nodes are mapped
 dynamically to ironic-conductor instances; the coordination for
 failover/takeover needs to be improved, but that's incremental at this
 point. Moving this mapping outside of Ironic is going to be messy and
 complicated, and breaks the abstraction layer. The API change may seem
 small, but it will massively overcomplicate Nova by duplicating all
 the functionality of Ironic-conductor in another layer of the stack.
 

Can you point us to the design for this? I didn't really get that
from browsing the code and docs, and I gave up trying to find a single
architecture document after very little effort.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-26 Thread Robert Collins
On 27 January 2014 08:04, Joshua Harlow harlo...@yahoo-inc.com wrote:
 Doesn't nova already have logic for creating N virtual machines (similar to a 
 group) in the same request? I thought it did (maybe it doesn't anymore in the 
 v3 API), creating N bare metal machines seems like it would comply to that 
 api?

It does, but it splits it into N concurrent single server requests so
that they get spread out amongst different nova-compute processes -
getting you parallelisation: and the code for single server requests
is sufficiently complex that having a rarely used path that preserves
the batch seems undesirable to me.

Besides which, as Ironic also dispatches work to many different
backend workers, sending a batch to Ironic would just result in it
having to split it out as well.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-26 Thread Joshua Harlow
Thanks, guess this is entering the realm of scheduling  group scheduling and 
how just the right level of information is needed to do efficient group 
scheduling in nova/ironic vs the new/upcoming gantt service. 

To me splitting it into N single requests isn't group scheduling but is just 
more of a batch processor to make things more parallel. To me it seems like 
gantt (or heat) or something else should know enough about the topology to 
identify where to schedule a request (or a group request) and then gantt/heat 
should pass enough location information to nova or ironic to let it know what 
was selected. Then nova or ironic can go about the dirty work of ensuring the 
instances were created reliably Of course it gets complicated when multiple 
resources are involved; but nobody said it was going to be easy ;)

Sent from my really tiny device...

 On Jan 26, 2014, at 12:25 PM, Robert Collins robe...@robertcollins.net 
 wrote:
 
 On 27 January 2014 08:04, Joshua Harlow harlo...@yahoo-inc.com wrote:
 Doesn't nova already have logic for creating N virtual machines (similar to 
 a group) in the same request? I thought it did (maybe it doesn't anymore in 
 the v3 API), creating N bare metal machines seems like it would comply to 
 that api?
 
 It does, but it splits it into N concurrent single server requests so
 that they get spread out amongst different nova-compute processes -
 getting you parallelisation: and the code for single server requests
 is sufficiently complex that having a rarely used path that preserves
 the batch seems undesirable to me.
 
 Besides which, as Ironic also dispatches work to many different
 backend workers, sending a batch to Ironic would just result in it
 having to split it out as well.
 
 -Rob
 
 -- 
 Robert Collins rbtcoll...@hp.com
 Distinguished Technologist
 HP Converged Cloud
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-25 Thread Robert Collins
On 25 January 2014 19:42, Clint Byrum cl...@fewbar.com wrote:
 Excerpts from Robert Collins's message of 2014-01-24 18:48:41 -0800:

  However, in looking at how Ironic works and interacts with Nova, it
  doesn't seem like there is any distinction of data per-compute-node
  inside Ironic.  So for this to work, I'd have to run a whole bunch of
  ironic instances, one per compute node. That seems like something we
  don't want to do.

 Huh?


 I can't find anything in Ironic that lets you group nodes by anything
 except chassis. It was not a serious discussion of how the problem would
 be solved, just a point that without some way to tie ironic nodes to
 compute-nodes I'd have to run multiple ironics.

I don't understand the point. There is no tie between ironic nodes and
compute nodes. Why do you want one?

 What makes you think this? Ironic runs in the same data centre as Nova...
 It it takes 2 Api calls to boot 1 physical machines is that really
 a performance problem? When other that first power on would you do that
 anyway?


 The API calls are meh. The image distribution and power fluctuations
 may not be.

But there isn't a strong connection between API call and image
distribution - e.g. (and this is my current favorite for 'when we get
to optimising') a glance multicast service - Ironic would just add
nodes to the relevant group as they are requested, and remove when
they complete, and glance can take care of stopping the service when
there are no members in the group.


  I actually think that the changes to Heat and Nova are trivial. Nova
  needs to have groups for compute nodes and the API needs to accept those
  groups. Heat needs to take advantage of them via the API.

 The changes to Nova would be massive and invasive as they would be
 redefining the driver apiand all the logic around it.


 I'm not sure I follow you at all. I'm suggesting that the scheduler have
 a new thing to filter on, and that compute nodes push their unique ID
 down into the Ironic driver so that while setting up nodes in Ironic one
 can assign them to a compute node. That doesn't sound massive and
 invasive.

I think we're perhaps talking about different things - in the section
you were answering, I thought he was talking about whether the API
should offer operations on arbitrary sets of nodes at once, or whether
each operation should be a separate API call vs what I now think you
were talking about which was whether operations should be able to
describe logical relations to other instances/nodes. Perhaps if we use
the term 'batch' rather than 'group' to talk about the
multiple-things-at-once aspect, and grouping to talk about the
primarily scheduler related problems of affinity / anti affinity etc,
we can avoid future confusion.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-25 Thread Clint Byrum
Excerpts from Robert Collins's message of 2014-01-25 02:47:42 -0800:
 On 25 January 2014 19:42, Clint Byrum cl...@fewbar.com wrote:
  Excerpts from Robert Collins's message of 2014-01-24 18:48:41 -0800:
 
   However, in looking at how Ironic works and interacts with Nova, it
   doesn't seem like there is any distinction of data per-compute-node
   inside Ironic.  So for this to work, I'd have to run a whole bunch of
   ironic instances, one per compute node. That seems like something we
   don't want to do.
 
  Huh?
 
 
  I can't find anything in Ironic that lets you group nodes by anything
  except chassis. It was not a serious discussion of how the problem would
  be solved, just a point that without some way to tie ironic nodes to
  compute-nodes I'd have to run multiple ironics.
 
 I don't understand the point. There is no tie between ironic nodes and
 compute nodes. Why do you want one?
 

Because sans Ironic, compute-nodes still have physical characteristics
that make grouping on them attractive for things like anti-affinity. I
don't really want my HA instances not on the same compute node, I want
them not in the same failure domain. It becomes a way for all
OpenStack workloads to have more granularity than availability zone.

So if we have all of that modeled in compute-nodes, then when adding
physical hardware to Ironic one just needs to have something to model
the same relationship for each physical hardware node. We don't have to
do it by linking hardware nodes to compute-nodes, but that would be
doable for a first cut without much change to Ironic.

  What makes you think this? Ironic runs in the same data centre as Nova...
  It it takes 2 Api calls to boot 1 physical machines is that really
  a performance problem? When other that first power on would you do that
  anyway?
 
 
  The API calls are meh. The image distribution and power fluctuations
  may not be.
 
 But there isn't a strong connection between API call and image
 distribution - e.g. (and this is my current favorite for 'when we get
 to optimising') a glance multicast service - Ironic would just add
 nodes to the relevant group as they are requested, and remove when
 they complete, and glance can take care of stopping the service when
 there are no members in the group.
 

I think we agree here. Entirely. :)

   I actually think that the changes to Heat and Nova are trivial. Nova
   needs to have groups for compute nodes and the API needs to accept those
   groups. Heat needs to take advantage of them via the API.
 
  The changes to Nova would be massive and invasive as they would be
  redefining the driver apiand all the logic around it.
 
 
  I'm not sure I follow you at all. I'm suggesting that the scheduler have
  a new thing to filter on, and that compute nodes push their unique ID
  down into the Ironic driver so that while setting up nodes in Ironic one
  can assign them to a compute node. That doesn't sound massive and
  invasive.
 
 I think we're perhaps talking about different things - in the section
 you were answering, I thought he was talking about whether the API
 should offer operations on arbitrary sets of nodes at once, or whether
 each operation should be a separate API call vs what I now think you
 were talking about which was whether operations should be able to
 describe logical relations to other instances/nodes. Perhaps if we use
 the term 'batch' rather than 'group' to talk about the
 multiple-things-at-once aspect, and grouping to talk about the
 primarily scheduler related problems of affinity / anti affinity etc,
 we can avoid future confusion.
 

Yes, thats a good point. I was talking about modeling failure domains
only.  Batching API requests seems like an entirely different thing.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-24 Thread Clint Byrum
Excerpts from Devananda van der Veen's message of 2014-01-22 16:44:01 -0800:
 
 1: physical vs. logical grouping
 - Some hardware is logically, but not strictly physically, grouped. Eg, 1U
 servers in the same rack. There is some grouping, such as failure domain,
 but operations on discrete nodes are discreet. This grouping should be
 modeled somewhere, and some times a user may wish to perform an operation
 on that group. Is a higher layer (tuskar, heat, etc) sufficient? I think so.
 - Some hardware _is_ physically grouped. Eg, high-density cartridges which
 share firmware state or a single management end-point, but are otherwise
 discrete computing devices. This grouping must be modeled somewhere, and
 certain operations can not be performed on one member without affecting all
 members. Things will break if each node is treated independently.
 

What Tuskar wants to do is layer workloads on top of logical and physical
groupings. So it would pass to Nova Boot 4 machines with (flavor)
and distinct(failure_domain_id)

Now, this is not unique to baremetal. There are plenty of cloud workloads
where one would like anti-affinity and other such things that will span
more than a single compute node. Right now these are at a very coarse
level which is availability zone. I think it is useful for Nova to
be able to have a list of aspects for each compute node which are not
hierarchical and isolate failure domains which matter for different
work-loads.

And with that, if we simply require at least one instance of nova-compute
running for each set of aspects, Ironic does not have to model this data.

However, in looking at how Ironic works and interacts with Nova, it
doesn't seem like there is any distinction of data per-compute-node
inside Ironic.  So for this to work, I'd have to run a whole bunch of
ironic instances, one per compute node. That seems like something we
don't want to do.

So perhaps if ironic can just model _a single_ logical grouping per node,
it can defer any further distinctions up to Nova where it will benefit
all workloads, not just Ironic.

 2: performance optimization
 - Some operations may be optimized if there is an awareness of concurrent
 identical operations. Eg, deploy the same image to lots of nodes using
 multicast or bittorrent. If Heat were to inform Ironic that this deploy is
 part of a group, the optimization would be deterministic. If Heat does not
 inform Ironic of this grouping, but Ironic infers it (eg, from timing of
 requests for similar actions) then optimization is possible but
 non-deterministic, and may be much harder to reason about or debug.
 

I'm wary of trying to get too deep on optimization this early. There
are some blanket optimizations that you allude to here that I think will
likely work o-k with even the most minimal of clues.

 3: APIs
 - Higher layers of OpenStack (eg, Heat) are expected to orchestrate
 discrete resource units into a larger group operation. This is where the
 grouping happens today, but already results in inefficiencies when
 performing identical operations at scale. Ironic may be able to get around
 this by coalescing adjacent requests for the same operation, but this would
 be non-deterministic.

Agreed, I think Ironic needs _some_ level of grouping to be efficient.

 - Moving group-awareness or group-operations into the lower layers (eg,
 Ironic) looks like it will require non-trivial changes to Heat and Nova,
 and, in my opinion, violates a layer-constraint that I would like to
 maintain. On the other hand, we could avoid the challenges around
 coalescing. This might be necessary to support physically-grouped hardware
 anyway, too.


I actually think that the changes to Heat and Nova are trivial. Nova
needs to have groups for compute nodes and the API needs to accept those
groups. Heat needs to take advantage of them via the API.

There is a non-trivial follow-on which is a wholistic scheduler which
would further extend these groups into other physical resources like
networks and block devices. These all feel like logical evolutions of the
idea of making somewhat arbitrary and overlapping groups of compute nodes.

 
 If Ironic coalesces requests, it could be done in either the
 ConductorManager layer or in the drivers themselves. The difference would
 be whether our internal driver API accepts one node or a set of nodes for
 each operation. It'll also impact our locking model. Both of these are
 implementation details that wouldn't affect other projects, but would
 affect our driver developers.
 
 Also, until Ironic models physically-grouped hardware relationships in some
 internal way, we're going to have difficulty supporting that class of
 hardware. Is that OK? What is the impact of not supporting such hardware?
 It seems, at least today, to be pretty minimal.

I don't know much about hardware like that. I think it should just be
another grouping unless it affects the way Ironic talks to the hardware,
at which point it probably belongs 

Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-24 Thread Robert Collins
On 25 Jan 2014 15:11, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Devananda van der Veen's message of 2014-01-22 16:44:01
-0800:

 What Tuskar wants to do is layer workloads on top of logical and physical
 groupings. So it would pass to Nova Boot 4 machines with (flavor)
 and distinct(failure_domain_id)

Maybe. Maybe it would ask for a reservation and then ask for machines
within that reservation Until it is unopened we are speculating :-)

 However, in looking at how Ironic works and interacts with Nova, it
 doesn't seem like there is any distinction of data per-compute-node
 inside Ironic.  So for this to work, I'd have to run a whole bunch of
 ironic instances, one per compute node. That seems like something we
 don't want to do.

Huh?

 So perhaps if ironic can just model _a single_ logical grouping per node,
 it can defer any further distinctions up to Nova where it will benefit
 all workloads, not just Ironic.

Agreed with this.

be deterministic. If Heat does not
  inform Ironic of this grouping, but Ironic infers it (eg, from timing of
  requests for similar actions) then optimization is possible but
  non-deterministic, and may be much harder to reason about or debug.
 

 I'm wary of trying to get too deep on optimization this early. There
 are some blanket optimizations that you allude to here that I think will
 likely work o-k with even the most minimal of clues.

+1 premature optimisation and the root of all evil...

  3: APIs
r the same operation, but this would
  be non-deterministic.

 Agreed, I think Ironic needs _some_ level of grouping to be efficient.

What makes you think this? Ironic runs in the same data centre as Nova...
It it takes 2 Api calls to boot 1 physical machines is that really
a performance problem? When other that first power on would you do that
anyway?

  - Moving group-awareness or group-operations into the lower layers (eg,
  Ironic) looks like it will require non-trivial changes to Heat and Nova,
  and, in my opinion, violates a layer-constraint that I would like to
  maintain. On the other hand, we could avoid the challenges around
  coalescing. This might be necessary to support physically-grouped
hardware
  anyway, too.
 

 I actually think that the changes to Heat and Nova are trivial. Nova
 needs to have groups for compute nodes and the API needs to accept those
 groups. Heat needs to take advantage of them via the API.

The changes to Nova would be massive and invasive as they would be
redefining the driver apiand all the logic around it.

 There is a non-trivial follow-on which is a wholistic scheduler which
 would further extend these groups into other physical resources like
 networks and block devices. These all feel like logical evolutions of the
 idea of making somewhat arbitrary and overlapping groups of compute nodes.

The holistic scheduler can also be a holistic reserver plus reservation
aware scheduler -this avoids a lot of pain imo
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-24 Thread Robert Collins
On 23 Jan 2014 13:45, Devananda van der Veen devananda@gmail.com
wrote:

 So, a conversation came again up today around whether or not Ironic will,
in the future, support operations on groups of nodes. Some folks have
expressed a desire for Ironic to expose operations on groups of nodes;
others want Ironic to host the hardware-grouping data so that eg. Heat and
Tuskar can make more intelligent group-aware decisions or represent the
groups in a UI. Neither of these have an implementation in Ironic today...
and we still need to implement a host of other things before we start on
this. FWIW, this discussion is meant to stimulate thinking ahead to things
we might address in Juno, and aligning development along the way.

So I'm pretty thoroughly against this at this point in time. The rest of
OpenStack has a single item at a time coding style ... Booting multiple
instances is quickly transformed into n single instance boots.

I think clear  identification of services we need can take care of sharing
workloads within ironic effectively. E.g. teach glance to multicast images
and the story for doing many identical deploys at once becomes super simple.
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-24 Thread Clint Byrum
Excerpts from Robert Collins's message of 2014-01-24 18:48:41 -0800:
 On 25 Jan 2014 15:11, Clint Byrum cl...@fewbar.com wrote:
 
  Excerpts from Devananda van der Veen's message of 2014-01-22 16:44:01
 -0800:
 
  What Tuskar wants to do is layer workloads on top of logical and physical
  groupings. So it would pass to Nova Boot 4 machines with (flavor)
  and distinct(failure_domain_id)
 
 Maybe. Maybe it would ask for a reservation and then ask for machines
 within that reservation Until it is unopened we are speculating :-)
 

Reservation is a better way to put it, I do agree with that.

  However, in looking at how Ironic works and interacts with Nova, it
  doesn't seem like there is any distinction of data per-compute-node
  inside Ironic.  So for this to work, I'd have to run a whole bunch of
  ironic instances, one per compute node. That seems like something we
  don't want to do.
 
 Huh?


I can't find anything in Ironic that lets you group nodes by anything
except chassis. It was not a serious discussion of how the problem would
be solved, just a point that without some way to tie ironic nodes to
compute-nodes I'd have to run multiple ironics.

  So perhaps if ironic can just model _a single_ logical grouping per node,
  it can defer any further distinctions up to Nova where it will benefit
  all workloads, not just Ironic.
 
 Agreed with this.
 
 be deterministic. If Heat does not
   inform Ironic of this grouping, but Ironic infers it (eg, from timing of
   requests for similar actions) then optimization is possible but
   non-deterministic, and may be much harder to reason about or debug.
  
 
  I'm wary of trying to get too deep on optimization this early. There
  are some blanket optimizations that you allude to here that I think will
  likely work o-k with even the most minimal of clues.
 
 +1 premature optimisation and the root of all evil...
 
   3: APIs
 r the same operation, but this would
   be non-deterministic.
 
  Agreed, I think Ironic needs _some_ level of grouping to be efficient.
 
 What makes you think this? Ironic runs in the same data centre as Nova...
 It it takes 2 Api calls to boot 1 physical machines is that really
 a performance problem? When other that first power on would you do that
 anyway?
 

The API calls are meh. The image distribution and power fluctuations
may not be.

   - Moving group-awareness or group-operations into the lower layers (eg,
   Ironic) looks like it will require non-trivial changes to Heat and Nova,
   and, in my opinion, violates a layer-constraint that I would like to
   maintain. On the other hand, we could avoid the challenges around
   coalescing. This might be necessary to support physically-grouped
 hardware
   anyway, too.
  
 
  I actually think that the changes to Heat and Nova are trivial. Nova
  needs to have groups for compute nodes and the API needs to accept those
  groups. Heat needs to take advantage of them via the API.
 
 The changes to Nova would be massive and invasive as they would be
 redefining the driver apiand all the logic around it.
 

I'm not sure I follow you at all. I'm suggesting that the scheduler have
a new thing to filter on, and that compute nodes push their unique ID
down into the Ironic driver so that while setting up nodes in Ironic one
can assign them to a compute node. That doesn't sound massive and
invasive.

  There is a non-trivial follow-on which is a wholistic scheduler which
  would further extend these groups into other physical resources like
  networks and block devices. These all feel like logical evolutions of the
  idea of making somewhat arbitrary and overlapping groups of compute nodes.
 
 The holistic scheduler can also be a holistic reserver plus reservation
 aware scheduler -this avoids a lot of pain imo

I think what I said still applies with that model, but it definitely
becomes a lot more robust.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [Ironic] Node groups and multi-node operations

2014-01-22 Thread Devananda van der Veen
So, a conversation came again up today around whether or not Ironic will,
in the future, support operations on groups of nodes. Some folks have
expressed a desire for Ironic to expose operations on groups of nodes;
others want Ironic to host the hardware-grouping data so that eg. Heat and
Tuskar can make more intelligent group-aware decisions or represent the
groups in a UI. Neither of these have an implementation in Ironic today...
and we still need to implement a host of other things before we start on
this. FWIW, this discussion is meant to stimulate thinking ahead to things
we might address in Juno, and aligning development along the way.

There's also some refactoring / code cleanup which is going on and worth
mentioning because it touches the part of the code which this discussion
impacts. For our developers, here is additional context:
* our TaskManager class supports locking 1 node atomically, but both the
driver API and our REST API only support operating on one node at a time.
AFAIK, no where in the code do we actually pass a group of nodes.
* for historical reasons, our driver API requires both a TaskManager and a
Node object be passed to all methods. However, the TaskManager object
contains a reference to the Node(s) which it has acquired, so the node
parameter is redundant.
* we've discussed cleaning this up, but I'd like to avoid refactoring the
same interfaces again when we go to add group-awareness.


I'll try to summarize the different axis-of-concern around which the
discussion of node groups seem to converge...

1: physical vs. logical grouping
- Some hardware is logically, but not strictly physically, grouped. Eg, 1U
servers in the same rack. There is some grouping, such as failure domain,
but operations on discrete nodes are discreet. This grouping should be
modeled somewhere, and some times a user may wish to perform an operation
on that group. Is a higher layer (tuskar, heat, etc) sufficient? I think so.
- Some hardware _is_ physically grouped. Eg, high-density cartridges which
share firmware state or a single management end-point, but are otherwise
discrete computing devices. This grouping must be modeled somewhere, and
certain operations can not be performed on one member without affecting all
members. Things will break if each node is treated independently.

2: performance optimization
- Some operations may be optimized if there is an awareness of concurrent
identical operations. Eg, deploy the same image to lots of nodes using
multicast or bittorrent. If Heat were to inform Ironic that this deploy is
part of a group, the optimization would be deterministic. If Heat does not
inform Ironic of this grouping, but Ironic infers it (eg, from timing of
requests for similar actions) then optimization is possible but
non-deterministic, and may be much harder to reason about or debug.

3: APIs
- Higher layers of OpenStack (eg, Heat) are expected to orchestrate
discrete resource units into a larger group operation. This is where the
grouping happens today, but already results in inefficiencies when
performing identical operations at scale. Ironic may be able to get around
this by coalescing adjacent requests for the same operation, but this would
be non-deterministic.
- Moving group-awareness or group-operations into the lower layers (eg,
Ironic) looks like it will require non-trivial changes to Heat and Nova,
and, in my opinion, violates a layer-constraint that I would like to
maintain. On the other hand, we could avoid the challenges around
coalescing. This might be necessary to support physically-grouped hardware
anyway, too.


If Ironic coalesces requests, it could be done in either the
ConductorManager layer or in the drivers themselves. The difference would
be whether our internal driver API accepts one node or a set of nodes for
each operation. It'll also impact our locking model. Both of these are
implementation details that wouldn't affect other projects, but would
affect our driver developers.

Also, until Ironic models physically-grouped hardware relationships in some
internal way, we're going to have difficulty supporting that class of
hardware. Is that OK? What is the impact of not supporting such hardware?
It seems, at least today, to be pretty minimal.


Discussion is welcome.

-Devananda
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev