Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-11-04 Thread Yujun Zhang
Hi, doctors

Is there any update on this topic?

It seems parallel execution will boost the performance on large scale
system.

But there was a different opinion on whether the current workflow is
reasonable or not. Should the notification be sent from inspector directly
or setting VM state error and leave it to NOVA to send notification?

BTW: the doctor demo on OpenStack Summit is fabulous[1]. Don't miss it.

[1]
https://www.openstack.org/videos/video/demo-openstack-and-opnfv-keeping-your-mobile-phone-calls-connected

--
Yujun

On Tue, Oct 4, 2016 at 8:54 PM Juvonen, Tomi (Nokia - FI/Espoo) <
tomi.juvo...@nokia.com> wrote:

> Hi,
>
>
>
> Not DB issue since just sending notifications took same time as changing
> DB at the same.
>
>
>
> Br,
>
> Tomi
>
>
>
> *From:* Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
> *Sent:* Tuesday, October 04, 2016 3:48 PM
> *To:* Juvonen, Tomi (Nokia - FI/Espoo) ; Yujun
> Zhang ; opnfv-tech-discuss@lists.opnfv.org
>
>
> *Subject:* RE: [opnfv-tech-discuss] [Doctor] Reset Server State and
> alarms in general
>
>
>
> Tomi,
>
>
>
>
>
> So, it seems to be a DB bottleneck issue.
>
>
>
> Having bulk API to reset servers with query would be a solution?
>
>
>
> Anyhow, we can talk in the meeting soon.
>
>
>
>
>
> BR,
>
> Ryota
>
>
>
> *From:* Juvonen, Tomi (Nokia - FI/Espoo) [mailto:tomi.juvo...@nokia.com
> ]
> *Sent:* Tuesday, October 04, 2016 9:08 PM
> *To:* Mibu Ryota(壬生 亮太) ; Yujun Zhang <
> zhangyujun+...@gmail.com>; opnfv-tech-discuss@lists.opnfv.org
> *Subject:* Re: [opnfv-tech-discuss] [Doctor] Reset Server State and
> alarms in general
>
>
>
> Hi,
>
>
>
> Still modified the test so that I do not do the reset server state, but
> instead just make the notification about “reset server state error” for
> each instance when force-down API called:
>
> for instance in instances:
>
> notifications.send_update_with_states(context, instance,
>
> instance.vm_state, vm_states.ERROR,instance.task_state,
>
> None, service="compute", host=host,verify_states=False)
>
>
>
> This had the same result as trough instance.save() that also changes the
> DB. So didn’t make things any better.
>
>
>
> Br,
>
> Tomi
>
>
>
> *From:* opnfv-tech-discuss-boun...@lists.opnfv.org [
> mailto:opnfv-tech-discuss-boun...@lists.opnfv.org
> ] *On Behalf Of *Juvonen,
> Tomi (Nokia - FI/Espoo)
> *Sent:* Tuesday, October 04, 2016 12:30 PM
> *To:* Ryota Mibu ; Yujun Zhang <
> zhangyujun+...@gmail.com>; opnfv-tech-discuss@lists.opnfv.org
> *Subject:* Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset
> Server State and alarms in general
>
>
>
> Hi,
>
>
>
> 1.  Tried that 300 while it is also the default, so no difference.
>
>
>
> 2.  Then I modified force-down API that it internally makes reset
> server state for all instances on host (so it is the only API called from
> inspector) and was no difference:
> With 10VMs where 5 VMs on failing host: 1000ms
> With 10VMs where 5 VMs on failing host: 1040ms
> With 20VMs where 10 VMs on failing host: 1540ms to 1780ms
>
>
>
> 3.  Then added debug print over this code in modified forced-down API
> that gets servers and does reset state for each.
> Run Doctor test case:
> With 20VMs where 10 VMs on failing host: 1540ms
> In Nova code:
>
> Getting instances:
>
> instances = self.host_api.instance_get_all_by_host(context, host)
>
> Took: 32ms
>
> Looping 10 instances to make reset server state to error:
>
> for instance in instances:
>
> instance.vm_state = vm_states.ERROR
>
> instance.task_state = None
>
> instance.save(admin_state_reset=True)
>
> Took: 1250ms
>
> And can then also pick up the whole time the *API* took:
>
> 2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server
> [req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392
> 1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT
> /v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/*force-down* HTTP/1.1"
> status: 200 len: 354 time: 1.4085381
>
>
>
> So the usage of reset server state is currently not feasible (and like
> indicated before, shouldn’t even be used).
>
>
>
> Br,
>
> Tomi
>
>
>
> *From:* Ryota Mibu [mailto:r-m...@cq.jp.nec.com ]
> *Sent:* Saturday, October 01, 2016 8:54 AM
> *To:* Yujun Zhang ; Juvonen, Tomi (Nokia -
> FI/Espoo) ; opnfv-tech-discuss@lists.opnfv.org
> *Subject:* RE: [opnfv-tech-discuss] [Doctor] Reset Server State and
> al

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-10-07 Thread Kunzmann, Gerald
Hi Tomi,

Nice! I am relieved that this simple trick worked. Thanks for the testing.

Have a nice weekend.

Best regards,
Gerald

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Freitag, 7. Oktober 2016 11:05
To: 'Ryota Mibu' ; 'Yujun Zhang' 
; 'opnfv-tech-discuss@lists.opnfv.org' 

Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Good news!

Changing Inspector to make parallel execution (threads) for reset server state 
did the trick and now we are back in business ☺
Doctor test case now goes through with any decent number of VMs on failing host:
With 20VMs where 10 VMs on failing host: 340ms
(This used to be: With 20VMs where 10 VMs on failing host: 1540ms to 1780ms)
Works also with bigger amount of VMs:
With 100VMs where 50 VMs on failing host: 550ms

Have a nice weekend.

Br,
Tomi

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Wednesday, October 05, 2016 10:27 AM
To: 'Ryota Mibu' mailto:r-m...@cq.jp.nec.com>>; 'Yujun 
Zhang' mailto:zhangyujun+...@gmail.com>>; 
'opnfv-tech-discuss@lists.opnfv.org' 
mailto:opnfv-tech-discuss@lists.opnfv.org>>
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,

Furthermore it seems that making the payload to notification in Nova is the 
pain point, so I do not think there is much to optimize in that as of the 
general usage of that. This takes ~115ms in Nokia POD for each VM.

That would leave options:

· Can notifications/reset server state be parallel (quick and dirty fix 
if even possible)

· Can there be new notification.

oCan there be new tenant specific notification instead of notification for 
each VM. Meaning if tenant has 10 VMs on a failing host, there would only be 
one tenant specific notification including all his VMs. Payload should not be 
as heavy as currently as this would be for the tenant specific alarm of his VMs 
on failing host. Also as not so many notifications, should not have the 
cumulative problem of wasting time in forming several notifications. Also one 
could subscribe to this tenant specifically and not like now that one need to 
subscribe alarms VM id specifically. Currently this would still remain as 
downside of current implementation if need to alarm only single VM failure, 
making the tenant alarm subscribing not very convenient.

§  This can be achieved by having this new notification done by force-down API 
and the unwanted reset server state could be removed.

§  This can be achieved by having this new notification done by Inspector and 
the unwanted reset server state could be removed. (faster, can be parallel to 
force-down call).

oNeeded state information is already done by force-down API as of 
implementation of “get valid server state BP in Nova” and therefore no reset 
server state is needed. Inspector should do the needed notification. This would 
have the fastest execution time where information do not need to flow through 
nova to notifier and has it easy to have parallel execution to notification and 
force-down API. This is the right thing in long run (while technically speaking 
of “Telco grade” there would be even more to enhance to have things as quickly 
as possible, like one should subscribe alarms directly from Inspector; but we 
cannot achieve everything that easily and that will be far in future.)

Br,
Tomi


From: Juvonen, Tomi (Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 3:08 PM
To: Ryota Mibu mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
notifications.send_update_with_states(context, instance,
instance.vm_state, vm_states.ERROR,instance.task_state,
None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
mailto:zhangyujun+...@gmail.com>>

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-10-07 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Good news!

Changing Inspector to make parallel execution (threads) for reset server state 
did the trick and now we are back in business ☺
Doctor test case now goes through with any decent number of VMs on failing host:
With 20VMs where 10 VMs on failing host: 340ms
(This used to be: With 20VMs where 10 VMs on failing host: 1540ms to 1780ms)
Works also with bigger amount of VMs:
With 100VMs where 50 VMs on failing host: 550ms

Have a nice weekend.

Br,
Tomi

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Wednesday, October 05, 2016 10:27 AM
To: 'Ryota Mibu' ; 'Yujun Zhang' 
; 'opnfv-tech-discuss@lists.opnfv.org' 

Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,

Furthermore it seems that making the payload to notification in Nova is the 
pain point, so I do not think there is much to optimize in that as of the 
general usage of that. This takes ~115ms in Nokia POD for each VM.

That would leave options:

·Can notifications/reset server state be parallel (quick and dirty fix 
if even possible)

·Can there be new notification.

o   Can there be new tenant specific notification instead of notification for 
each VM. Meaning if tenant has 10 VMs on a failing host, there would only be 
one tenant specific notification including all his VMs. Payload should not be 
as heavy as currently as this would be for the tenant specific alarm of his VMs 
on failing host. Also as not so many notifications, should not have the 
cumulative problem of wasting time in forming several notifications. Also one 
could subscribe to this tenant specifically and not like now that one need to 
subscribe alarms VM id specifically. Currently this would still remain as 
downside of current implementation if need to alarm only single VM failure, 
making the tenant alarm subscribing not very convenient.

§  This can be achieved by having this new notification done by force-down API 
and the unwanted reset server state could be removed.

§  This can be achieved by having this new notification done by Inspector and 
the unwanted reset server state could be removed. (faster, can be parallel to 
force-down call).

o   Needed state information is already done by force-down API as of 
implementation of “get valid server state BP in Nova” and therefore no reset 
server state is needed. Inspector should do the needed notification. This would 
have the fastest execution time where information do not need to flow through 
nova to notifier and has it easy to have parallel execution to notification and 
force-down API. This is the right thing in long run (while technically speaking 
of “Telco grade” there would be even more to enhance to have things as quickly 
as possible, like one should subscribe alarms directly from Inspector; but we 
cannot achieve everything that easily and that will be far in future.)

Br,
Tomi


From: Juvonen, Tomi (Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 3:08 PM
To: Ryota Mibu mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
notifications.send_update_with_states(context, instance,
instance.vm_state, vm_states.ERROR,instance.task_state,
None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,


1.  Tried that 300 while it is also the default, so no difference.



2.  Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.  Then added debug print over this code in modified forced-down API that 
gets se

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-10-05 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,

Furthermore it seems that making the payload to notification in Nova is the 
pain point, so I do not think there is much to optimize in that as of the 
general usage of that. This takes ~115ms in Nokia POD for each VM.

That would leave options:

·Can notifications/reset server state be parallel (quick and dirty fix 
if even possible)

·Can there be new notification.

o   Can there be new tenant specific notification instead of notification for 
each VM. Meaning if tenant has 10 VMs on a failing host, there would only be 
one tenant specific notification including all his VMs. Payload should not be 
as heavy as currently as this would be for the tenant specific alarm of his VMs 
on failing host. Also as not so many notifications, should not have the 
cumulative problem of wasting time in forming several notifications. Also one 
could subscribe to this tenant specifically and not like now that one need to 
subscribe alarms VM id specifically. Currently this would still remain as 
downside of current implementation if need to alarm only single VM failure, 
making the tenant alarm subscribing not very convenient.

§  This can be achieved by having this new notification done by force-down API 
and the unwanted reset server state could be removed.

§  This can be achieved by having this new notification done by Inspector and 
the unwanted reset server state could be removed. (faster, can be parallel to 
force-down call).

o   Needed state information is already done by force-down API as of 
implementation of “get valid server state BP in Nova” and therefore no reset 
server state is needed. Inspector should do the needed notification. This would 
have the fastest execution time where information do not need to flow through 
nova to notifier and has it easy to have parallel execution to notification and 
force-down API. This is the right thing in long run (while technically speaking 
of “Telco grade” there would be even more to enhance to have things as quickly 
as possible, like one should subscribe alarms directly from Inspector; but we 
cannot achieve everything that easily and that will be far in future.)

Br,
Tomi


From: Juvonen, Tomi (Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 3:08 PM
To: Ryota Mibu ; Yujun Zhang ; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
notifications.send_update_with_states(context, instance,
instance.vm_state, vm_states.ERROR,instance.task_state,
None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu ; Yujun Zhang ; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,


1.  Tried that 300 while it is also the default, so no difference.



2.  Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.  Then added debug print over this code in modified forced-down API that 
gets servers and does reset state for each.
Run Doctor test case:
With 20VMs where 10 VMs on failing host: 1540ms
In Nova code:
Getting instances:
instances = self.host_api.instance_get_all_by_host(context, host)
Took: 32ms
Looping 10 instances to make reset server state to error:
for instance in instances:
instance.vm_state = vm_states.ERROR
instance.task_state = None
instance.save(admin_state_reset=True)
Took: 1250ms
And can then also pick up the whole time the API took:
2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server 
[req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392 
1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT 
/v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/force-down HTTP/1.1" status: 
200 len: 354 time: 1.4085381

So the usage of reset server state is currently not feasible (and like 
indicated before, shouldn’t even be used).

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Saturday, October 01, 2016 8:54 AM
To: Yujun Zhang mailto:zhangyujun+...@gmail.com>>; 
Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-10-04 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,

Not DB issue since just sending notifications took same time as changing DB at 
the same.

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Tuesday, October 04, 2016 3:48 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) ; Yujun Zhang 
; opnfv-tech-discuss@lists.opnfv.org
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Tomi,


So, it seems to be a DB bottleneck issue.

Having bulk API to reset servers with query would be a solution?

Anyhow, we can talk in the meeting soon.


BR,
Ryota

From: Juvonen, Tomi (Nokia - FI/Espoo) [mailto:tomi.juvo...@nokia.com]
Sent: Tuesday, October 04, 2016 9:08 PM
To: Mibu Ryota(壬生 亮太) mailto:r-m...@cq.jp.nec.com>>; 
Yujun Zhang mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
notifications.send_update_with_states(context, instance,
instance.vm_state, vm_states.ERROR,instance.task_state,
None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,


1.  Tried that 300 while it is also the default, so no difference.



2.  Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.  Then added debug print over this code in modified forced-down API that 
gets servers and does reset state for each.
Run Doctor test case:
With 20VMs where 10 VMs on failing host: 1540ms
In Nova code:
Getting instances:
instances = self.host_api.instance_get_all_by_host(context, host)
Took: 32ms
Looping 10 instances to make reset server state to error:
for instance in instances:
instance.vm_state = vm_states.ERROR
instance.task_state = None
instance.save(admin_state_reset=True)
Took: 1250ms
And can then also pick up the whole time the API took:
2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server 
[req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392 
1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT 
/v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/force-down HTTP/1.1" status: 
200 len: 354 time: 1.4085381

So the usage of reset server state is currently not feasible (and like 
indicated before, shouldn’t even be used).

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Saturday, October 01, 2016 8:54 AM
To: Yujun Zhang mailto:zhangyujun+...@gmail.com>>; 
Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,


That’s interesting evaluation!

Yes, this should be an important issue.

I doubt that token validation in keystone might take major part of processing 
time. If so, we should consider to use keystone trust which can skip the 
validation or to use token caches. Tomi, can you try the same evaluation with 
enabling token caches in client (by --os-cache) and nova ([keystone_authtoken] 
token_cache_time=300)?

OR, maybe we can check how many HTTP messages and DB query happened per VM 
reset?


Thanks,
Ryota

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Yujun Zhang
Sent: Friday, September 30, 2016 5:53 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

It is almost linear to the number of VMs since the requests are sent one by 
one. And I think we should raise the priority of 

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-10-04 Thread Ryota Mibu
Tomi,


So, it seems to be a DB bottleneck issue.

Having bulk API to reset servers with query would be a solution?

Anyhow, we can talk in the meeting soon.


BR,
Ryota

From: Juvonen, Tomi (Nokia - FI/Espoo) [mailto:tomi.juvo...@nokia.com]
Sent: Tuesday, October 04, 2016 9:08 PM
To: Mibu Ryota(壬生 亮太) ; Yujun Zhang 
; opnfv-tech-discuss@lists.opnfv.org
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
notifications.send_update_with_states(context, instance,
instance.vm_state, vm_states.ERROR,instance.task_state,
None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu mailto:r-m...@cq.jp.nec.com>>; Yujun Zhang 
mailto:zhangyujun+...@gmail.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,


1.   Tried that 300 while it is also the default, so no difference.



2.   Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.   Then added debug print over this code in modified forced-down API that 
gets servers and does reset state for each.
Run Doctor test case:
With 20VMs where 10 VMs on failing host: 1540ms
In Nova code:
Getting instances:
instances = self.host_api.instance_get_all_by_host(context, host)
Took: 32ms
Looping 10 instances to make reset server state to error:
for instance in instances:
instance.vm_state = vm_states.ERROR
instance.task_state = None
instance.save(admin_state_reset=True)
Took: 1250ms
And can then also pick up the whole time the API took:
2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server 
[req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392 
1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT 
/v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/force-down HTTP/1.1" status: 
200 len: 354 time: 1.4085381

So the usage of reset server state is currently not feasible (and like 
indicated before, shouldn’t even be used).

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Saturday, October 01, 2016 8:54 AM
To: Yujun Zhang mailto:zhangyujun+...@gmail.com>>; 
Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,


That’s interesting evaluation!

Yes, this should be an important issue.

I doubt that token validation in keystone might take major part of processing 
time. If so, we should consider to use keystone trust which can skip the 
validation or to use token caches. Tomi, can you try the same evaluation with 
enabling token caches in client (by --os-cache) and nova ([keystone_authtoken] 
token_cache_time=300)?

OR, maybe we can check how many HTTP messages and DB query happened per VM 
reset?


Thanks,
Ryota

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Yujun Zhang
Sent: Friday, September 30, 2016 5:53 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

It is almost linear to the number of VMs since the requests are sent one by 
one. And I think we should raise the priority of this issue.

But I wonder what it will perform if the requests are sent simultaneously with 
async calls. How will nova deal with that?
On Fri, Sep 30, 2016 at 4:25 PM Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>> wrote:
Hi,
Run Doctor test case in Nokia POD with APEX installer and state of the art 
Airframe HW. Modified the Doctor test case so that I can run several VMs and 
consumer can receive alarms from those. Measuring is it possible to stay within 
docto

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-10-04 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,

Still modified the test so that I do not do the reset server state, but instead 
just make the notification about “reset server state error” for each instance 
when force-down API called:
for instance in instances:
notifications.send_update_with_states(context, instance,
instance.vm_state, vm_states.ERROR,instance.task_state,
None, service="compute", host=host,verify_states=False)

This had the same result as trough instance.save() that also changes the DB. So 
didn’t make things any better.

Br,
Tomi

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Tuesday, October 04, 2016 12:30 PM
To: Ryota Mibu ; Yujun Zhang ; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Suspected SPAM - Re: [opnfv-tech-discuss] [Doctor] Reset Server State 
and alarms in general

Hi,


1.  Tried that 300 while it is also the default, so no difference.



2.  Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.  Then added debug print over this code in modified forced-down API that 
gets servers and does reset state for each.
Run Doctor test case:
With 20VMs where 10 VMs on failing host: 1540ms
In Nova code:
Getting instances:
instances = self.host_api.instance_get_all_by_host(context, host)
Took: 32ms
Looping 10 instances to make reset server state to error:
for instance in instances:
instance.vm_state = vm_states.ERROR
instance.task_state = None
instance.save(admin_state_reset=True)
Took: 1250ms
And can then also pick up the whole time the API took:
2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server 
[req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392 
1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT 
/v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/force-down HTTP/1.1" status: 
200 len: 354 time: 1.4085381

So the usage of reset server state is currently not feasible (and like 
indicated before, shouldn’t even be used).

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Saturday, October 01, 2016 8:54 AM
To: Yujun Zhang mailto:zhangyujun+...@gmail.com>>; 
Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,


That’s interesting evaluation!

Yes, this should be an important issue.

I doubt that token validation in keystone might take major part of processing 
time. If so, we should consider to use keystone trust which can skip the 
validation or to use token caches. Tomi, can you try the same evaluation with 
enabling token caches in client (by --os-cache) and nova ([keystone_authtoken] 
token_cache_time=300)?

OR, maybe we can check how many HTTP messages and DB query happened per VM 
reset?


Thanks,
Ryota

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Yujun Zhang
Sent: Friday, September 30, 2016 5:53 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

It is almost linear to the number of VMs since the requests are sent one by 
one. And I think we should raise the priority of this issue.

But I wonder what it will perform if the requests are sent simultaneously with 
async calls. How will nova deal with that?
On Fri, Sep 30, 2016 at 4:25 PM Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>> wrote:
Hi,
Run Doctor test case in Nokia POD with APEX installer and state of the art 
Airframe HW. Modified the Doctor test case so that I can run several VMs and 
consumer can receive alarms from those. Measuring is it possible to stay within 
doctor requirement of under 1second from recognizing the fault to having alarm 
to consumer. This way I can see how much overhead comes when more VMs on 
failing host (overhead comes from calling reset server state API for each VM on 
failing host).

Here is how many milliseconds it took to get scenario trough:
With 1 VM on failing host: 180ms
With 10VMs where 5 VMs on failing host: 800ms to 1040ms
With 20VMs where 12 VMs on failing host: 2410ms
With 20VMs where 13 VMs on failing host: 2010ms
With 20VMs where 11 VMs on failing host: 2380ms
With 50VMs where 27 VMs on failing host: 5060ms
With 100VMs where 49 VMs on failing host

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-10-04 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,


1.  Tried that 300 while it is also the default, so no difference.



2.  Then I modified force-down API that it internally makes reset server 
state for all instances on host (so it is the only API called from inspector) 
and was no difference:
With 10VMs where 5 VMs on failing host: 1000ms
With 10VMs where 5 VMs on failing host: 1040ms
With 20VMs where 10 VMs on failing host: 1540ms to 1780ms


3.  Then added debug print over this code in modified forced-down API that 
gets servers and does reset state for each.
Run Doctor test case:
With 20VMs where 10 VMs on failing host: 1540ms
In Nova code:
Getting instances:
instances = self.host_api.instance_get_all_by_host(context, host)
Took: 32ms
Looping 10 instances to make reset server state to error:
for instance in instances:
instance.vm_state = vm_states.ERROR
instance.task_state = None
instance.save(admin_state_reset=True)
Took: 1250ms
And can then also pick up the whole time the API took:
2016-10-04 09:05:46.075 5029 INFO nova.osapi_compute.wsgi.server 
[req-368d7fa5-dad6-4805-b9ed-535bf05fff06 b175813579a14b5d9eafe759a1d3e392 
1dedc52c8caa42b8aea83b913035f5d9 - - -] 192.0.2.6 "PUT 
/v2.1/1dedc52c8caa42b8aea83b913035f5d9/os-services/force-down HTTP/1.1" status: 
200 len: 354 time: 1.4085381

So the usage of reset server state is currently not feasible (and like 
indicated before, shouldn’t even be used).

Br,
Tomi

From: Ryota Mibu [mailto:r-m...@cq.jp.nec.com]
Sent: Saturday, October 01, 2016 8:54 AM
To: Yujun Zhang ; Juvonen, Tomi (Nokia - FI/Espoo) 
; opnfv-tech-discuss@lists.opnfv.org
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,


That’s interesting evaluation!

Yes, this should be an important issue.

I doubt that token validation in keystone might take major part of processing 
time. If so, we should consider to use keystone trust which can skip the 
validation or to use token caches. Tomi, can you try the same evaluation with 
enabling token caches in client (by --os-cache) and nova ([keystone_authtoken] 
token_cache_time=300)?

OR, maybe we can check how many HTTP messages and DB query happened per VM 
reset?


Thanks,
Ryota

From: 
opnfv-tech-discuss-boun...@lists.opnfv.org<mailto:opnfv-tech-discuss-boun...@lists.opnfv.org>
 [mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Yujun Zhang
Sent: Friday, September 30, 2016 5:53 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>>; 
opnfv-tech-discuss@lists.opnfv.org<mailto:opnfv-tech-discuss@lists.opnfv.org>
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

It is almost linear to the number of VMs since the requests are sent one by 
one. And I think we should raise the priority of this issue.

But I wonder what it will perform if the requests are sent simultaneously with 
async calls. How will nova deal with that?
On Fri, Sep 30, 2016 at 4:25 PM Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>> wrote:
Hi,
Run Doctor test case in Nokia POD with APEX installer and state of the art 
Airframe HW. Modified the Doctor test case so that I can run several VMs and 
consumer can receive alarms from those. Measuring is it possible to stay within 
doctor requirement of under 1second from recognizing the fault to having alarm 
to consumer. This way I can see how much overhead comes when more VMs on 
failing host (overhead comes from calling reset server state API for each VM on 
failing host).

Here is how many milliseconds it took to get scenario trough:
With 1 VM on failing host: 180ms
With 10VMs where 5 VMs on failing host: 800ms to 1040ms
With 20VMs where 12 VMs on failing host: 2410ms
With 20VMs where 13 VMs on failing host: 2010ms
With 20VMs where 11 VMs on failing host: 2380ms
With 50VMs where 27 VMs on failing host: 5060ms
With 100VMs where 49 VMs on failing host: 8180ms

Conclusion: With ideal environment one can run 5 VMs on a host and still 
fulfill Doctor requirement. So this needs to be enhanced.
___
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss


Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-30 Thread Ryota Mibu
Hi,


That’s interesting evaluation!

Yes, this should be an important issue.

I doubt that token validation in keystone might take major part of processing 
time. If so, we should consider to use keystone trust which can skip the 
validation or to use token caches. Tomi, can you try the same evaluation with 
enabling token caches in client (by --os-cache) and nova ([keystone_authtoken] 
token_cache_time=300)?

OR, maybe we can check how many HTTP messages and DB query happened per VM 
reset?


Thanks,
Ryota

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Yujun Zhang
Sent: Friday, September 30, 2016 5:53 PM
To: Juvonen, Tomi (Nokia - FI/Espoo) ; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

It is almost linear to the number of VMs since the requests are sent one by 
one. And I think we should raise the priority of this issue.

But I wonder what it will perform if the requests are sent simultaneously with 
async calls. How will nova deal with that?
On Fri, Sep 30, 2016 at 4:25 PM Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>> wrote:
Hi,
Run Doctor test case in Nokia POD with APEX installer and state of the art 
Airframe HW. Modified the Doctor test case so that I can run several VMs and 
consumer can receive alarms from those. Measuring is it possible to stay within 
doctor requirement of under 1second from recognizing the fault to having alarm 
to consumer. This way I can see how much overhead comes when more VMs on 
failing host (overhead comes from calling reset server state API for each VM on 
failing host).

Here is how many milliseconds it took to get scenario trough:
With 1 VM on failing host: 180ms
With 10VMs where 5 VMs on failing host: 800ms to 1040ms
With 20VMs where 12 VMs on failing host: 2410ms
With 20VMs where 13 VMs on failing host: 2010ms
With 20VMs where 11 VMs on failing host: 2380ms
With 50VMs where 27 VMs on failing host: 5060ms
With 100VMs where 49 VMs on failing host: 8180ms

Conclusion: With ideal environment one can run 5 VMs on a host and still 
fulfill Doctor requirement. So this needs to be enhanced.
___
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss


Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-30 Thread Yujun Zhang
It is almost linear to the number of VMs since the requests are sent one by
one. And I think we should raise the priority of this issue.

But I wonder what it will perform if the requests are sent simultaneously
with async calls. How will nova deal with that?

On Fri, Sep 30, 2016 at 4:25 PM Juvonen, Tomi (Nokia - FI/Espoo) <
tomi.juvo...@nokia.com> wrote:

> Hi,
>
> Run Doctor test case in Nokia POD with APEX installer and state of the art
> Airframe HW. Modified the Doctor test case so that I can run several VMs
> and consumer can receive alarms from those. Measuring is it possible to
> stay within doctor requirement of under 1second from recognizing the fault
> to having alarm to consumer. This way I can see how much overhead comes
> when more VMs on failing host (overhead comes from calling reset server
> state API for each VM on failing host).
>
>
>
> Here is how many milliseconds it took to get scenario trough:
>
> With 1 VM on failing host: 180ms
>
> With 10VMs where 5 VMs on failing host: 800ms to 1040ms
>
> With 20VMs where 12 VMs on failing host: 2410ms
>
> With 20VMs where 13 VMs on failing host: 2010ms
>
> With 20VMs where 11 VMs on failing host: 2380ms
>
> With 50VMs where 27 VMs on failing host: 5060ms
>
> With 100VMs where 49 VMs on failing host: 8180ms
>
>
>
> Conclusion: With ideal environment one can run 5 VMs on a host and still
> fulfill Doctor requirement. So this needs to be enhanced.
>
___
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss


Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-30 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,

Run Doctor test case in Nokia POD with APEX installer and state of the art 
Airframe HW. Modified the Doctor test case so that I can run several VMs and 
consumer can receive alarms from those. Measuring is it possible to stay within 
doctor requirement of under 1second from recognizing the fault to having alarm 
to consumer. This way I can see how much overhead comes when more VMs on 
failing host (overhead comes from calling reset server state API for each VM on 
failing host).

Here is how many milliseconds it took to get scenario trough:
With 1 VM on failing host: 180ms
With 10VMs where 5 VMs on failing host: 800ms to 1040ms
With 20VMs where 12 VMs on failing host: 2410ms
With 20VMs where 13 VMs on failing host: 2010ms
With 20VMs where 11 VMs on failing host: 2380ms
With 50VMs where 27 VMs on failing host: 5060ms
With 100VMs where 49 VMs on failing host: 8180ms

Conclusion: With ideal environment one can run 5 VMs on a host and still 
fulfill Doctor requirement. So this needs to be enhanced.

Br,
Tomi

From: Juvonen, Tomi (Nokia - FI/Espoo)
Sent: Thursday, September 29, 2016 12:23 PM
To: 'Yujun Zhang' ; opnfv-tech-discuss@lists.opnfv.org
Subject: RE: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi,

Inline, hope it explains some.

As short the alternatives:
1./a. Have reset server state for all VMs on host with single API call. Either 
with 1./a/1. host specific reset server state or 1./a/2. host force down API.
1./b No reset server state is done. host force down API would trigger new kind 
of notification for each tenant about effected VMs.
2. No notification trough controller. Inspector would form notifications to 
notifier. Easier to tailor notifications and alarms as we want. Only host force 
down API called be Inspector, no reset server state.

Br,
Tomi

From: Yujun Zhang [mailto:zhangyujun+...@gmail.com]
Sent: Thursday, September 29, 2016 9:37 AM
To: Juvonen, Tomi (Nokia - FI/Espoo) ; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi, Tomi

Thanks for the summary.

I am a bit confused about the difference between the 2. and 1./b.  Would you 
please give an example to explain how it would work?
In 2. Inspector send the notification, not controller. That means notification 
can be tailored exactly to meet the needs. This is not the case with 1./b. 2. 
would assume we can tailor the notification and alarm(s) the way we want it to 
be.

Suppose we have

- tenant-a
  - vm-a on host-a
- tenant-b
  - vm-b on host-a

When a raw failure occurs on host-a, the existing sequence[1] would be

1. Monitor send "host-a failure" event to Inspector
2. Inspector find affected VMs (get all vm in host-a)
Inspector should know already  VMs on host, it shouldn’t get them here anymore 
if properly implemented. Anyhow you are right, this is what we have currently.
3. Inspector resets affected VMs (vm-a and vm-b) to error state
Currently Inspector reset servers to error state to get notification form 
controller (Nova). “2.” And “1./b.” States Inspector should not reset servers 
to error state. Only force down host.
4. Controller request Notifier to notify all
In 2. Controller is not the one making notification that triggers alarm.
...

I think this is how "1./a." works.
yes

For "1./b." it seems to be close to the alternative sequence in fault 
management scenario[2]. Instead of waiting for Controller to send notification, 
the Inspector will directly inform the Notifier about it.
1./b. send notification from controller for all the VMs when force down host 
API is called on Nova. In 2. Notification to is send directly from Inspector to 
notifier.

Apparently, 5a is mandatory before 5b and 5c. But 5b and 5c. (alt) can
be triggered simultaneously with async calls.

If we deploy vitrage as the inspector, VMs state error could be deduced and 
notified independently from "5b. Update State" action. Then the time required 
for updating all VMs state would not matter any more.
“Get valid server state”  work fulfilled the VM to have host_status so one get 
proper state when host down. This was done exactly as when host is down, there 
was no indication when querying servers that it has a problem (trough Nova 
servers API). This was as reset server state was not to be called. Also no 
existing vm state field was to be changed to indicate host is down. Anyhow if 
we in Doctor still insist that reset server state should be called, it is great 
it can be done independently as you say.

"1./b." looks good to me but I'd like to hear more on "2."
1. Monitor send "host-a failure" event to Inspector
2. Inspector knows topology already, so it internally figures out the VMs on 
host by different tenants
3. Inspector force down host (and does fencing of host that we have nothing 
done currently)
4. Inspector  sends needed notifications to notif

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-29 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,

Inline, hope it explains some.

As short the alternatives:
1./a. Have reset server state for all VMs on host with single API call. Either 
with 1./a/1. host specific reset server state or 1./a/2. host force down API.
1./b No reset server state is done. host force down API would trigger new kind 
of notification for each tenant about effected VMs.
2. No notification trough controller. Inspector would form notifications to 
notifier. Easier to tailor notifications and alarms as we want. Only host force 
down API called be Inspector, no reset server state.

Br,
Tomi

From: Yujun Zhang [mailto:zhangyujun+...@gmail.com]
Sent: Thursday, September 29, 2016 9:37 AM
To: Juvonen, Tomi (Nokia - FI/Espoo) ; 
opnfv-tech-discuss@lists.opnfv.org
Subject: Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in 
general

Hi, Tomi

Thanks for the summary.

I am a bit confused about the difference between the 2. and 1./b.  Would you 
please give an example to explain how it would work?
In 2. Inspector send the notification, not controller. That means notification 
can be tailored exactly to meet the needs. This is not the case with 1./b. 2. 
would assume we can tailor the notification and alarm(s) the way we want it to 
be.

Suppose we have

- tenant-a
  - vm-a on host-a
- tenant-b
  - vm-b on host-a

When a raw failure occurs on host-a, the existing sequence[1] would be

1. Monitor send "host-a failure" event to Inspector
2. Inspector find affected VMs (get all vm in host-a)
Inspector should know already  VMs on host, it shouldn’t get them here anymore 
if properly implemented. Anyhow you are right, this is what we have currently.
3. Inspector resets affected VMs (vm-a and vm-b) to error state
Currently Inspector reset servers to error state to get notification form 
controller (Nova). “2.” And “1./b.” States Inspector should not reset servers 
to error state. Only force down host.
4. Controller request Notifier to notify all
In 2. Controller is not the one making notification that triggers alarm.
...

I think this is how "1./a." works.
yes

For "1./b." it seems to be close to the alternative sequence in fault 
management scenario[2]. Instead of waiting for Controller to send notification, 
the Inspector will directly inform the Notifier about it.
1./b. send notification from controller for all the VMs when force down host 
API is called on Nova. In 2. Notification to is send directly from Inspector to 
notifier.
Apparently, 5a is mandatory before 5b and 5c. But 5b and 5c. (alt) can
be triggered simultaneously with async calls.

If we deploy vitrage as the inspector, VMs state error could be deduced and 
notified independently from "5b. Update State" action. Then the time required 
for updating all VMs state would not matter any more.
“Get valid server state”  work fulfilled the VM to have host_status so one get 
proper state when host down. This was done exactly as when host is down, there 
was no indication when querying servers that it has a problem (trough Nova 
servers API). This was as reset server state was not to be called. Also no 
existing vm state field was to be changed to indicate host is down. Anyhow if 
we in Doctor still insist that reset server state should be called, it is great 
it can be done independently as you say.

"1./b." looks good to me but I'd like to hear more on "2."
1. Monitor send "host-a failure" event to Inspector
2. Inspector knows topology already, so it internally figures out the VMs on 
host by different tenants
3. Inspector force down host (and does fencing of host that we have nothing 
done currently)
4. Inspector  sends needed notifications to notifier to form exactly the alarms 
needed by tenants (and probably also better alarm for physical fault than could 
be made by notification about nova-compute service state change trough 
service.update notification)

[1] http://artifacts.opnfv.org/doctor/docs/index.html#figure-p1
Yes the figure shows trough controller, but it has been discussed already 
earlier wither this is a good idea.
[2] http://artifacts.opnfv.org/doctor/docs/index.html#figure8

On Wed, Sep 28, 2016 at 1:27 PM Juvonen, Tomi (Nokia - FI/Espoo) 
mailto:tomi.juvo...@nokia.com>> wrote:
Hi,

As discussed yesterday in the Doctor meeting, there is several ways to approach 
the problem and many different aspects. If trying to make blueprint to 
OpenStack Nova, there is a window now open to do it couple of weeks to make it 
in next Ocata release (or Danube in OPNFV). Not sure if time to make that, but 
here is a summary:


1.  The way we use “reset server state” is not the way it is used in the 
OpenStack. Force down host doesn’t need resetting servers state.
Do we want to state that we still want to use it anyway because we want the 
notification to have alarm?

a.  Yes:

1.  Do we want to enhance the functionality to reset servers state for all 
servers on a host?

2

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-28 Thread Yujun Zhang
Hi, Tomi

Thanks for the summary.

I am a bit confused about the difference between the 2. and 1./b.  Would
you please give an example to explain how it would work?

Suppose we have

- tenant-a
  - vm-a on host-a
- tenant-b
  - vm-b on host-a

When a raw failure occurs on host-a, the existing sequence[1] would be

1. Monitor send "host-a failure" event to Inspector
2. Inspector find affected VMs (get all vm in host-a)
3. Inspector resets affected VMs (vm-a and vm-b) to error state
4. Controller request Notifier to notify all
...

I think this is how "*1./a.*" works.

For "*1./b.*" it seems to be close to the alternative sequence in fault
management scenario[2]. Instead of waiting for Controller to send
notification, the Inspector will directly inform the Notifier about it.

Apparently, 5a is mandatory before 5b and 5c. But 5b and 5c. *(alt)* can
be triggered simultaneously with async calls.

If we deploy vitrage as the inspector, VMs state error could be *deduced*
and notified *independently* from "5b. Update State" action. Then the time
required for updating all VMs state would not matter any more.

"*1./b.*" looks good to me but I'd like to hear more on "2."

[1] *http://artifacts.opnfv.org/doctor/docs/index.html#figure-p1
*
[2] *http://artifacts.opnfv.org/doctor/docs/index.html#figure8
*

On Wed, Sep 28, 2016 at 1:27 PM Juvonen, Tomi (Nokia - FI/Espoo) <
tomi.juvo...@nokia.com> wrote:

> Hi,
>
>
>
> As discussed yesterday in the Doctor meeting, there is several ways to
> approach the problem and many different aspects. If trying to make
> blueprint to OpenStack Nova, there is a window now open to do it couple of
> weeks to make it in next Ocata release (or Danube in OPNFV). Not sure if
> time to make that, but here is a summary:
>
>
>
> 1.  The way we use “reset server state” is not the way it is used in
> the OpenStack. Force down host doesn’t need resetting servers state.
>
> Do we want to state that we still want to use it anyway because we want
> the notification to have alarm?
>
> a.  Yes:
>
> 1.  Do we want to enhance the functionality to reset servers state
> for all servers on a host?
>
> 2.  Do we want force down API to be able to optionally reset server
> state for all VMs on host?
>
> Note! “Get valid server state” was done because the reason that there is
> no server specific state changing when there is a host specific fault (as
> reset server state is not called). This is why a host_status field was
> added for user querying his server to know there is nothing wrong with his
> VM, but it is currently down as host is in that state.
>
>
>
> b.  No:
>
> We could try to have a change when calling force down host, it would send
> a notification about effected VMs (as many notifications as there is
> tenants with VMs).
>
>
>
> 2.  Only inspector knows everything that is needed for different
> alarms and it is just overhead to push that information trough for example
> Nova to get notification that can translate to alarm. Also we do not get
> the right content to alarms anyhow.  This leads to a fact that only way to
> have things right is to send notification from inspector to notifier to
> have right kind of alarms: Tenant specific alarms with their VMs and
> separate physical fault alarm (with respect to ETSI GS NFV-IFA 005)
>
>
>
> IMHO the only right choice is “2.” Next one would be the “1. / b.”. The
> least feasible thing would be to do the “1. / a.”.
>
>
>
> Br,
>
> Tomi
>
___
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss


Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-27 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,

As discussed yesterday in the Doctor meeting, there is several ways to approach 
the problem and many different aspects. If trying to make blueprint to 
OpenStack Nova, there is a window now open to do it couple of weeks to make it 
in next Ocata release (or Danube in OPNFV). Not sure if time to make that, but 
here is a summary:


1.  The way we use "reset server state" is not the way it is used in the 
OpenStack. Force down host doesn't need resetting servers state.
Do we want to state that we still want to use it anyway because we want the 
notification to have alarm?

a.  Yes:

1.  Do we want to enhance the functionality to reset servers state for all 
servers on a host?

2.  Do we want force down API to be able to optionally reset server state 
for all VMs on host?
Note! "Get valid server state" was done because the reason that there is no 
server specific state changing when there is a host specific fault (as reset 
server state is not called). This is why a host_status field was added for user 
querying his server to know there is nothing wrong with his VM, but it is 
currently down as host is in that state.


b.  No:
We could try to have a change when calling force down host, it would send a 
notification about effected VMs (as many notifications as there is tenants with 
VMs).


2.  Only inspector knows everything that is needed for different alarms and 
it is just overhead to push that information trough for example Nova to get 
notification that can translate to alarm. Also we do not get the right content 
to alarms anyhow.  This leads to a fact that only way to have things right is 
to send notification from inspector to notifier to have right kind of alarms: 
Tenant specific alarms with their VMs and separate physical fault alarm (with 
respect to ETSI GS NFV-IFA 005)

IMHO the only right choice is "2." Next one would be the "1. / b.". The least 
feasible thing would be to do the "1. / a.".

Br,
Tomi

From: opnfv-tech-discuss-boun...@lists.opnfv.org 
[mailto:opnfv-tech-discuss-boun...@lists.opnfv.org] On Behalf Of Juvonen, Tomi 
(Nokia - FI/Espoo)
Sent: Wednesday, September 21, 2016 9:52 AM
To: opnfv-tech-discuss@lists.opnfv.org
Subject: Suspected SPAM - [opnfv-tech-discuss] [Doctor] Reset Server State and 
alarms in general

Hi,

I had a lively discussion yesterday with OpenStack Nova cores about the reset 
server state. At first how to have that by one API call for all VMs on a host 
(hypervisor) as discussed in DOCTOR-78. But then it came to a question why we 
actually want the reset server state in the first place. It is not something 
that need to do if force down a host. If we want a notification about effected 
VMs and further an alarm, then that is another thing. So if we want that kind 
of notification, it is then something we should make a spec. Not to reset state 
to error for each VM on a host that we should not be doing in the first place 
if error was not on VM, but host level (yes before you ask, Nova can have the 
working VM state unchanged if host is down. You do not touch VM state if you do 
not want to do something for the VM or if it was actually the one having error. 
Yes and you do not want to do anything for the VM itself in all scenarios, but 
just be happy it comes up again on same host when host comes back.)

Again I realize here and what I have said a long ago before we had anything. It 
will not be possible to make alarms correctly by changing state in Nova and 
other controllers and then triggering alarm from the notification about those 
state changes. That will never have what we want for the alarms, while 
otherwise we sure need to correct states. Even for things we get a notification 
triggered by state change, we will not have information needed in alarm and 
surely we do not call APIs in vain, just to have alarm (like reset server 
state) .

We want tenant/VNFM specific alarms to tells which his VMs (virtual resources) 
are effected by fault and a cause (and surely alarms about physical faults that 
will not be consumed by tenant/VNFM and other fields needed by ETSI spec). Only 
way of having this correct for each kind of fault that can appear, is to form 
all the alarms (notification to form alarm) in the Inspector (Congress or 
Vitrage). It is the only place that has all the information needed in different 
scenarios and can make this right and has the minimum delay that is crucial in 
Telco fault management. Also if looking to have OPNFV used in production and 
one would need to be OPNFV compliant, it means we need to make things right. I 
strongly suggest that while we have the way we make alarm as a great step we 
have achieved so far as proof of concept (changing states and having alarm 
under 1 second), let's make next steps to go towards having conceptually 
correct way to achieve this and have correct alarms.

Br,
Tomi



__

Re: [opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-21 Thread Yujun Zhang
After reading the whole message, I could not agree more on the conclusion,
IIUC, we should probably raise a deducted alarm in inspector instead of
requesting the controller to reset server state.

On Wed, Sep 21, 2016 at 2:51 PM Juvonen, Tomi (Nokia - FI/Espoo) <
tomi.juvo...@nokia.com> wrote:

> Hi,
>
> I had a lively discussion yesterday with OpenStack Nova cores about the
> reset server state. At first how to have that by one API call for all VMs
> on a host (hypervisor) as discussed in DOCTOR-78. But then it came to a
> question why we actually want the reset server state in the first place. It
> is not something that need to do if force down a host. If we want a
> notification about effected VMs and further an alarm, then that is another
> thing. So if we want that kind of notification, it is then something we
> should make a spec.
>

This sounds like a job of the inspector like vitrage, i.e. deduct a VM
error from host error and raise a deducted alarm.

Not to reset state to error for each VM on a host that we should not be
> doing in the first place if error was not on VM, but host level (yes before
> you ask, Nova can have the working VM state unchanged if host is down. You
> do not touch VM state if you do not want to do something for the VM or if
> it was actually the one having error. Yes and you do not want to do
> anything for the VM itself in all scenarios, but just be happy it comes up
> again on same host when host comes back.)
>

Agree


> Again I realize here and what I have said a long ago before we had
> anything. It will not be possible to make alarms correctly by changing
> state in Nova and other controllers and then triggering alarm from the
> notification about those state changes. That will never have what we want
> for the alarms, while otherwise we sure need to correct states. Even for
> things we get a notification triggered by state change, we will not have
> information needed in alarm and surely we do not call APIs in vain, just to
> have alarm (like reset server state) .
>
> We want tenant/VNFM specific alarms to tells which his VMs (virtual
> resources) are effected by fault and a cause (and surely alarms about
> physical faults that will not be consumed by tenant/VNFM and other fields
> needed by ETSI spec). Only way of having this correct for each kind of
> fault that can appear, is to form all the alarms (notification to form
> alarm) in the Inspector (Congress or Vitrage).
>

I have exactly the same understanding.

It is the only place that has all the information needed in different
> scenarios and can make this right and has the minimum delay that is crucial
> in Telco fault management. Also if looking to have OPNFV used in production
> and one would need to be OPNFV compliant, it means we need to make things
> right. I strongly suggest that while we have the way we make alarm as a
> great step we have achieved so far as proof of concept (changing states and
> having alarm under 1 second), let’s make next steps to go towards having
> conceptually correct way to achieve this and have correct alarms.
>
> Br,
> Tomi
>
>
>
> ___
> opnfv-tech-discuss mailing list
> opnfv-tech-discuss@lists.opnfv.org
> https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss
>
___
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss


[opnfv-tech-discuss] [Doctor] Reset Server State and alarms in general

2016-09-20 Thread Juvonen, Tomi (Nokia - FI/Espoo)
Hi,

I had a lively discussion yesterday with OpenStack Nova cores about the reset 
server state. At first how to have that by one API call for all VMs on a host 
(hypervisor) as discussed in DOCTOR-78. But then it came to a question why we 
actually want the reset server state in the first place. It is not something 
that need to do if force down a host. If we want a notification about effected 
VMs and further an alarm, then that is another thing. So if we want that kind 
of notification, it is then something we should make a spec. Not to reset state 
to error for each VM on a host that we should not be doing in the first place 
if error was not on VM, but host level (yes before you ask, Nova can have the 
working VM state unchanged if host is down. You do not touch VM state if you do 
not want to do something for the VM or if it was actually the one having error. 
Yes and you do not want to do anything for the VM itself in all scenarios, but 
just be happy it comes up again on same host when host comes back.)

Again I realize here and what I have said a long ago before we had anything. It 
will not be possible to make alarms correctly by changing state in Nova and 
other controllers and then triggering alarm from the notification about those 
state changes. That will never have what we want for the alarms, while 
otherwise we sure need to correct states. Even for things we get a notification 
triggered by state change, we will not have information needed in alarm and 
surely we do not call APIs in vain, just to have alarm (like reset server 
state) .

We want tenant/VNFM  specific alarms to tells which his VMs (virtual resources) 
are effected by fault and a cause (and surely alarms about physical faults that 
will not be consumed by tenant/VNFM and other fields needed by ETSI spec). Only 
way of having this correct for each kind of fault that can appear, is to form 
all the alarms (notification to form alarm) in the Inspector (Congress or 
Vitrage). It is the only place that has all the information needed in different 
scenarios and can make this right and has the minimum delay that is crucial in 
Telco fault management. Also if looking to have OPNFV used in production and 
one would need to be OPNFV compliant, it means we need to make things right. I 
strongly suggest that while we have the way we make alarm as a great step we 
have achieved so far as proof of concept (changing states and having alarm 
under 1 second), let's make next steps to go towards having conceptually 
correct way to achieve this and have correct alarms.

Br,
Tomi



___
opnfv-tech-discuss mailing list
opnfv-tech-discuss@lists.opnfv.org
https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss