Re: [openstack-dev] [nova] Rocky PTG summary - cells

2018-03-15 Thread Surya Seetharaman
I would also prefer not having to rely on reading all the cell DBs to
calculate quotas.


On Thu, Mar 15, 2018 at 3:29 AM, melanie witt  wrote:

>
>
> I would prefer not to block instance creations because of "down" cells,


​++

​


> so maybe there is some possibility to avoid it if we can get
> "queued_for_delete" and "user_id" columns added to the instance_mappings
> table.
>
>
seems reason enough to add them from my perspective.


Regards,
Surya.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Rocky PTG summary - cells

2018-03-15 Thread Zhenyu Zheng
Thanks for the reply, both solution looks reasonable.

On Thu, Mar 15, 2018 at 10:29 AM, melanie witt  wrote:

> On Thu, 15 Mar 2018 09:54:59 +0800, Zhenyu Zheng wrote:
>
>> Thanks for the recap, got one question for the "block creation":
>>
>> * An attempt to create an instance should be blocked if the project
>> has instances in a "down" cell (the instance_mappings table has a
>> "project_id" column) because we cannot count instances in "down"
>> cells for the quota check.
>>
>>
>> Since users are not aware of any cell information, and the cells are
>> mostly randomly selected, there could be high possibility that
>> users(projects) instances are equally spreaded across cells. The proposed
>> behavior seems can
>> easily cause a lot of users couldn't create instances because one of the
>> cells is down, isn't it too rude?
>>
>
> To be honest, I share your concern. I had planned to change quota checks
> to use placement instead of reading cell databases ASAP but hit a snag
> where we won't be able to count instances from placement because we can't
> determine the "type" of an allocation. Allocations can be instances, or
> network-related resources, or volume-related resources, etc. Adding the
> concept of an allocation "type" in placement has been a controversial
> discussion so far.
>
> BUT ... we also said we would add a column like "queued_for_delete" to the
> instance_mappings table. If we do that, we could count instances from the
> instance_mappings table in the API database and count cores/ram from
> placement and no longer rely on reading cell databases for quota checks.
> Although, there is one more wrinkle: instance_mappings has a project_id
> column but does not have a user_id column, so we wouldn't be able to get a
> count by project + user needed for the quota check against user quota. So,
> if people would not be opposed, we could also add a "user_id" column to
> instance_mappings to handle that case.
>
> I would prefer not to block instance creations because of "down" cells, so
> maybe there is some possibility to avoid it if we can get
> "queued_for_delete" and "user_id" columns added to the instance_mappings
> table.
>
> -melanie
>
>
>
>
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Rocky PTG summary - cells

2018-03-14 Thread melanie witt

On Thu, 15 Mar 2018 09:54:59 +0800, Zhenyu Zheng wrote:

Thanks for the recap, got one question for the "block creation":

* An attempt to create an instance should be blocked if the project
has instances in a "down" cell (the instance_mappings table has a
"project_id" column) because we cannot count instances in "down"
cells for the quota check.


Since users are not aware of any cell information, and the cells are 
mostly randomly selected, there could be high possibility that 
users(projects) instances are equally spreaded across cells. The 
proposed behavior seems can
easily cause a lot of users couldn't create instances because one of the 
cells is down, isn't it too rude?


To be honest, I share your concern. I had planned to change quota checks 
to use placement instead of reading cell databases ASAP but hit a snag 
where we won't be able to count instances from placement because we 
can't determine the "type" of an allocation. Allocations can be 
instances, or network-related resources, or volume-related resources, 
etc. Adding the concept of an allocation "type" in placement has been a 
controversial discussion so far.


BUT ... we also said we would add a column like "queued_for_delete" to 
the instance_mappings table. If we do that, we could count instances 
from the instance_mappings table in the API database and count cores/ram 
from placement and no longer rely on reading cell databases for quota 
checks. Although, there is one more wrinkle: instance_mappings has a 
project_id column but does not have a user_id column, so we wouldn't be 
able to get a count by project + user needed for the quota check against 
user quota. So, if people would not be opposed, we could also add a 
"user_id" column to instance_mappings to handle that case.


I would prefer not to block instance creations because of "down" cells, 
so maybe there is some possibility to avoid it if we can get 
"queued_for_delete" and "user_id" columns added to the instance_mappings 
table.


-melanie




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Rocky PTG summary - cells

2018-03-14 Thread Zhenyu Zheng
Thanks for the recap, got one question for the "block creation":

  * An attempt to create an instance should be blocked if the project has
> instances in a "down" cell (the instance_mappings table has a "project_id"
> column) because we cannot count instances in "down" cells for the quota
> check.


Since users are not aware of any cell information, and the cells are mostly
randomly selected, there could be high possibility that users(projects)
instances are equally spreaded across cells. The proposed behavior seems can
easily cause a lot of users couldn't create instances because one of the
cells is down, isn't it too rude?

BR,

Kevin Zheng


On Thu, Mar 15, 2018 at 2:26 AM, Chris Dent  wrote:

> On Wed, 14 Mar 2018, melanie witt wrote:
>
> I’ve created a summary etherpad [0] for the nova cells session from the
>> PTG and included a plain text export of it on this email.
>>
>
> Nice summary. Apparently I wasn't there or paying attention when
> something was decided:
>
>  * An attempt to delete an instance in a "down" cell should result in a
>> 500 or 503 error.
>>
>
> Depending on how we look at it, this doesn't really align with what
> 500 or 503 are supposed to be used. They are supposed to indicate
> that the web server is broken in some fashion: 500 being an
> unexpected and uncaught exception in the web server, 503 that the
> web server is either overloaded or down for maintenance.
>
> So, you could argue that 409 is the right thing here (as seems to
> always happen when we discuss these things). You send a DELETE to
> kill the instance, but the current state of the instance is "on a
> cell that can't be reached" which is in "conflict" with the state
> required to do a DELETE.
>
> If a 5xx is really necessary, for whatever reason, then 503 is a
> better choice than 500 because it at least signals that the broken
> thing is sort of "over there somewhere" rather than the web server
> having an error (which is what 500 is supposed to mean).
>
> --
> Chris Dent   ٩◔̯◔۶   https://anticdent.org/
> freenode: cdent tw: @anticdent
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova] Rocky PTG summary - cells

2018-03-14 Thread Chris Dent

On Wed, 14 Mar 2018, melanie witt wrote:


I’ve created a summary etherpad [0] for the nova cells session from the PTG and 
included a plain text export of it on this email.


Nice summary. Apparently I wasn't there or paying attention when
something was decided:


 * An attempt to delete an instance in a "down" cell should result in a 500 or 
503 error.


Depending on how we look at it, this doesn't really align with what
500 or 503 are supposed to be used. They are supposed to indicate
that the web server is broken in some fashion: 500 being an
unexpected and uncaught exception in the web server, 503 that the
web server is either overloaded or down for maintenance.

So, you could argue that 409 is the right thing here (as seems to
always happen when we discuss these things). You send a DELETE to
kill the instance, but the current state of the instance is "on a
cell that can't be reached" which is in "conflict" with the state
required to do a DELETE.

If a 5xx is really necessary, for whatever reason, then 503 is a
better choice than 500 because it at least signals that the broken
thing is sort of "over there somewhere" rather than the web server
having an error (which is what 500 is supposed to mean).

--
Chris Dent   ٩◔̯◔۶   https://anticdent.org/
freenode: cdent tw: @anticdent__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [nova] Rocky PTG summary - cells

2018-03-14 Thread melanie witt
Hi everyone,

I’ve created a summary etherpad [0] for the nova cells session from the PTG and 
included a plain text export of it on this email.

Thanks,
-melanie

[0] https://etherpad.openstack.org/p/nova-ptg-rocky-cells-summary

*Cells: Rocky PTG Summary

https://etherpad.openstack.org/p/nova-ptg-rocky L11

*Key topics

  * How to handle a "down" cell
  * How to handle each cell having a separate ceph cluster
  * How do we plan to progress on removing "upcalls"

*Agreements and decisions

  * In order to list instances even when we can't connect to a cell database, 
we'll construct something minimal from the API database and we'll add a column 
to the instance_mappings table such as "queued_for_delete" to determine which 
are the non-deleted instances and then list them.
* tssurya will write a spec for the new column.
  * We're not going to pursue the approach of having backup URLs for cell 
databases to fall back on when a cell is "down".
  * An attempt to delete an instance in a "down" cell should result in a 500 or 
503 error.
  * An attempt to create an instance should be blocked if the project has 
instances in a "down" cell (the instance_mappings table has a "project_id" 
column) because we cannot count instances in "down" cells for the quota check.
* At this time, we won't pursue the idea of adding an allocation "type" 
concept to placement (which could be leveraged for counting cores/ram resource 
usage for quotas).
  * The topic of each cell having a separate ceph cluster and having each cell 
cache images in the imagebackend led to the topic of the "cinder imagebackend" 
again.
* Implementing a cinder imagebackend in nova would be an enormous 
undertaking that realistically isn't going to happen.
* A pragmatic solution was suggested to make boot-from-volume a first class 
citizen and make automatic boot-from-volume work well, so that we let cinder 
handle the caching of images in this scenario (and of course handle all of the 
other use cases for cinder imagebackend). This would eventually lead to the 
deprecation of the ceph imagebackend. Further discussion is required on this.
  * On removing upcalls, progress in placement will help address the remaining 
upcalls.
* dansmith will work on filtering compute hosts using the volume 
availability zone to address the cinder/cross_az_attach issue. mriedem and 
bauzas will review.
* For the xenapi host aggregate upcall, the xenapi subteam will remove it 
as a patch on top of their live-migration support patch series.
* For the server group late affinity check up-call for server create and 
evacuate, the plan is to handle it race-free with placement/scheduler. However, 
affinity modeling in placement isn't slated for work in Rocky, so the late 
affinity check upcall will have to be removed in S, at the earliest.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev