Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-11-09 Thread Julien Danjou
On Mon, Sep 26 2016, Zhai, Edwin wrote:

> Do you have any idea to resolve this race condition?

So following our discussion at the summit, I went ahead and wrote a spec
about that feature:

  https://review.openstack.org/#/c/395058/

Comments are welcome obviously!

Cheers,
-- 
Julien Danjou
# Free Software hacker
# https://julien.danjou.info


signature.asc
Description: PGP signature
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-25 Thread Zhai, Edwin

On Fri, 23 Sep 2016, gordon chung wrote:




On 23/09/2016 2:18 AM, Zhai, Edwin wrote:



There are many targets(topics)/endpoints in above ceilometer code. But
in AODH, we just have one topic, 'alarm.all', and one endpoint. If it is
still multi-threaded, there is already potential race condition here,
but event-alarm tiemout make it worse.

https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63


see my reply to other message, but yes, it is multithreaded. there's not
race currently because we don't do anything that needs to honour ordering.


Currently, we still need ordering. e.g.
2 events with different traits could trigger same alarm. If they come in an 
interval big enough, the alarm would be triggered once(Second event see the 
state as 'ALARM' and give up).  If they come and is handled concurrently, the 
alarm possibly be triggered twice(Both event see the state as 'UNKNOWN').  This 
is wrong as event alarm is one-shot(if repeat_actions=False).


Do you have any idea to resolve this race condition?





event evaluator is triggered by event only, that is, it's not called at
all until next event comes. If no event comes, evaluator just sleeps so
that can't check timeout and update_alarm. In other words, 'timeout.end'
is just for waking up evaluator.



what's the purpose of the thread being created? i thought the idea was
to receive alarm.timeout.start event -> creates a thread? can we not:
1. receive alarm.timeout.start -> create an alarm with timeout thread
2a. if event received, kill timeout thread, update alarm.
2b. if timeout reached, send alarm notification, update alarm.

^ that is just a random thought, i didn't think about exactly how to
implement. right now i'm not clear who is generating this
alarm.timeout.end event and why it needs to do that at all.



It's good idea! We need one way for timeout calculation: new thread, or alarm 
signal. If alarm signal is more stable, let's turn to it.


We need one list to keep all alarms waiting for timeout, and update the list 
when timeout signal reached.


alarm.timeout.end event is just for locking, and generated by new thread or 
alarm signal handler(your suggestion). If it is useless for locking, we can give 
up and just update alarm directly as you said.




cheers,
--
gord



Best Rgds,
Edwin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-23 Thread gordon chung


On 23/09/2016 2:18 AM, Zhai, Edwin wrote:

>
> There are many targets(topics)/endpoints in above ceilometer code. But
> in AODH, we just have one topic, 'alarm.all', and one endpoint. If it is
> still multi-threaded, there is already potential race condition here,
> but event-alarm tiemout make it worse.
>
> https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63

see my reply to other message, but yes, it is multithreaded. there's not 
race currently because we don't do anything that needs to honour ordering.

>
> event evaluator is triggered by event only, that is, it's not called at
> all until next event comes. If no event comes, evaluator just sleeps so
> that can't check timeout and update_alarm. In other words, 'timeout.end'
> is just for waking up evaluator.
>

what's the purpose of the thread being created? i thought the idea was 
to receive alarm.timeout.start event -> creates a thread? can we not:
1. receive alarm.timeout.start -> create an alarm with timeout thread
2a. if event received, kill timeout thread, update alarm.
2b. if timeout reached, send alarm notification, update alarm.

^ that is just a random thought, i didn't think about exactly how to 
implement. right now i'm not clear who is generating this 
alarm.timeout.end event and why it needs to do that at all.

cheers,
-- 
gord

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-23 Thread gordon chung


On 23/09/2016 3:19 AM, Zhai, Edwin wrote:
> "Each notification listener is associated with an executor which
> controls how incoming notification messages will be received and
> dispatched. By default, the most simple executor is used - the blocking
> executor. This executor processes inbound notifications on the server’s
> thread, blocking it from processing additional notifications until it
> finishes with the current one."

we use threading executor[1] not blocking. unless something has changed 
with oslo.messaging recently, the behaviour is: you have 64 threads 
grabbing messages off queue. because it's all threads and we don't 
really know when they context switch so you can't guarantee order 
(unless you switch to one thread).

[1] 
https://github.com/openstack/aodh/blob/9c5df8400b621141db93a259900efe3702eb6241/aodh/messaging.py#L54

cheers,
-- 
gord
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-23 Thread Zhai, Edwin

Just check oslo messaging doc, don't know if it's out of date.

"Each notification listener is associated with an executor which controls how 
incoming notification messages will be received and dispatched. By default, the 
most simple executor is used - the blocking executor. This executor processes 
inbound notifications on the server’s thread, blocking it from processing 
additional notifications until it finishes with the current one."


Note: If the “eventlet” executor is used, the threading and time library need to 
be monkeypatched.



On Fri, 23 Sep 2016, Zhai, Edwin wrote:


Thanks for your clarification, see my comments below.

On Thu, 22 Sep 2016, gordon chung wrote:




On 22/09/2016 2:40 AM, Zhai, Edwin wrote:


See
https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158

evaluate_events is the handler of the endpoint for 'alarm.all', it
iterates the event list and evaluate them one by one with project
alarms. If both 'timeout.end' and 'X' are in the event list, I assume
they are handled in sequence at different iterations of for loop. Am I
right?


not exactly. the code above is actually an endpoint for event listener.
the event listener itself is threaded so in theory, we have 64 of these
endpoints/loops. you can override the threads to have just one but
that's where things slow down a lot. we handle this in ceilometer by
having many single thread listeners each handling it's own queue[1]. i
still need to publish diagram on how that works.

[1]
https://github.com/openstack/ceilometer/blob/master/ceilometer/notification.py#L308


There are many targets(topics)/endpoints in above ceilometer code. But in 
AODH, we just have one topic, 'alarm.all', and one endpoint. If it is still 
multi-threaded, there is already potential race condition here, but 
event-alarm tiemout make it worse.


https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63



deleted your sequence diagram since it's malformed in my response but
that is pretty cool.

a few questions:
- when alarm creation event arrives at evaluator it creates a thread to
process alarm. this thread will timeout and raise a new event if it
doesn't receive event in time? i don't understand why we need a
timeout.end event? can the evaluator not just update_alarm and notify if
we timeout? or update_alarm and skip notify if we receive event on time?


event evaluator is triggered by event only, that is, it's not called at all 
until next event comes. If no event comes, evaluator just sleeps so that 
can't check timeout and update_alarm. In other words, 'timeout.end' is just 
for waking up evaluator.




cheers,

--
gord



Best Rgds,
Edwin



Best Rgds,
Edwin__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-22 Thread Zhai, Edwin

Thanks for your clarification, see my comments below.

On Thu, 22 Sep 2016, gordon chung wrote:




On 22/09/2016 2:40 AM, Zhai, Edwin wrote:


See
https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158

evaluate_events is the handler of the endpoint for 'alarm.all', it
iterates the event list and evaluate them one by one with project
alarms. If both 'timeout.end' and 'X' are in the event list, I assume
they are handled in sequence at different iterations of for loop. Am I
right?


not exactly. the code above is actually an endpoint for event listener.
the event listener itself is threaded so in theory, we have 64 of these
endpoints/loops. you can override the threads to have just one but
that's where things slow down a lot. we handle this in ceilometer by
having many single thread listeners each handling it's own queue[1]. i
still need to publish diagram on how that works.

[1]
https://github.com/openstack/ceilometer/blob/master/ceilometer/notification.py#L308


There are many targets(topics)/endpoints in above ceilometer code. But in AODH, 
we just have one topic, 'alarm.all', and one endpoint. If it is still 
multi-threaded, there is already potential race condition here, but event-alarm 
tiemout make it worse.


https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63



deleted your sequence diagram since it's malformed in my response but
that is pretty cool.

a few questions:
- when alarm creation event arrives at evaluator it creates a thread to
process alarm. this thread will timeout and raise a new event if it
doesn't receive event in time? i don't understand why we need a
timeout.end event? can the evaluator not just update_alarm and notify if
we timeout? or update_alarm and skip notify if we receive event on time?


event evaluator is triggered by event only, that is, it's not called at all 
until next event comes. If no event comes, evaluator just sleeps so that can't 
check timeout and update_alarm. In other words, 'timeout.end' is just for waking 
up evaluator.




cheers,

--
gord



Best Rgds,
Edwin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-22 Thread gordon chung


On 22/09/2016 2:40 AM, Zhai, Edwin wrote:
>
> See
> https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158
>
> evaluate_events is the handler of the endpoint for 'alarm.all', it
> iterates the event list and evaluate them one by one with project
> alarms. If both 'timeout.end' and 'X' are in the event list, I assume
> they are handled in sequence at different iterations of for loop. Am I
> right?

not exactly. the code above is actually an endpoint for event listener. 
the event listener itself is threaded so in theory, we have 64 of these 
endpoints/loops. you can override the threads to have just one but 
that's where things slow down a lot. we handle this in ceilometer by 
having many single thread listeners each handling it's own queue[1]. i 
still need to publish diagram on how that works.

[1] 
https://github.com/openstack/ceilometer/blob/master/ceilometer/notification.py#L308

>
> If we have evaluate_timeout_events as handler of another endpoint for
> 'alarm.timeout', then 2 handlers can run concurrently to lead race
> condition. I'm not familiar with underline oslo notifications, and think
> separated queue is different story. Pls. correct me if I'm wrong.
>

deleted your sequence diagram since it's malformed in my response but 
that is pretty cool.

a few questions:
- when alarm creation event arrives at evaluator it creates a thread to 
process alarm. this thread will timeout and raise a new event if it 
doesn't receive event in time? i don't understand why we need a 
timeout.end event? can the evaluator not just update_alarm and notify if 
we timeout? or update_alarm and skip notify if we receive event on time?

cheers,

-- 
gord

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-21 Thread Zhai, Edwin

Gordon,
Thanks for your comments.
Pls. check my answer and flow chart below.

On Wed, 21 Sep 2016, gordon chung wrote:



=== event-alarm timeout implementation =
As it's for event-alarm, we need keep it as event-driven. Furthermore,
for quick response, we need use event for timeout handling. Periodic
worker can't meet real time requirement.

Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads
tricky race condition.  e.g.  'XYZ.done' comes in queue1, and
'alarm.timeout.end' comes in queue2, so that they are handled in
parallel way:

1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and
will be set ALARM in next step.
2. In queue2, 'alarm.timeout.end' is checking against same alarm(current
UNKNOWN), and will be set to OK(UNALARM) in next step.
3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)




can you clarify how this work? after user creates event timeout alarm
definition through API (i assume the alarm definition specify we should
see event x within y seconds).
- how does the evaluator get this alarm definition? is there an
alarm.timeout.start message?


Yes.


- what is this UNALARM state? to be honest, that isn't a real word so i
don't know what it's suppose to represent here.


It's OK - mean we have enough data to say: not trigger this alarm. Somebody 
mistaken it by ALARM, so I mark it as UN-ALARM.




biggest problem for me is the only thing i know is there's a
alarm.timeout.end event that needs to be handled by evaluator. i don't
know where it's coming from or what it's needed for.


I attached flow chart at bottom . pls. check it. 'timeout.end' and event 'X' 
comes in different ways, it's good if evaluator do not touch next one until 
previous one was handled.






So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells
the user: required event came, then no required event came;

If put all events in one queue, evaluator handles them one by one(low
level oslo mesg should be multi-threaded) so that second event would see
alarm state as not UNKNOWN, and give up its transition.  As Gordc said,
it's slow. But only very small part of the event-alarm need timeout
handling, as it's only for telco usage model.


so the multithreaded part is what i was talking about. it's not handling
them one by one. it's handling 64 (or whatever the default is) at any
given time. whether its' one queue or two, you have a race to handle.


See https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158

evaluate_events is the handler of the endpoint for 'alarm.all', it iterates the 
event list and evaluate them one by one with project alarms. If both 
'timeout.end' and 'X' are in the event list, I assume they are handled in 
sequence at different iterations of for loop. Am I right?


If we have evaluate_timeout_events as handler of another endpoint for 
'alarm.timeout', then 2 handlers can run concurrently to lead race condition. 
I'm not familiar with underline oslo notifications, and think separated queue is 
different story. Pls. correct me if I'm wrong.


for e in events:
try:
event = Event(e)
..

for id, alarm in six.iteritems(
self._get_project_alarms(event.project)):
try:
self._evaluate_alarm(alarm, event)
...



+--+   ++ +--+ ++  
+---+  ++
|  User|   | API server | | Notification bus | | Evaluator  |  
| Threads   |  | Alarm state|
+--+---+   +-+--+ ++-+ +-+--+  
++--+  +--+-+
   | | | |  
||
   +---> | | |  
||
   | +-+ | | |  
||
   | |Alarm create | | | |  
|+---+
   | |event: X | | | |  
|| UNKNOWN   |
   | |timeout: 5s  | | | |  
|+---+
   | +-+ | | |  
||
   | +-> |  
||
   | | +-+ | |  
||
   | | |Event sent:  | | |  
||
   |

Re: [openstack-dev] [AODH] event-alarm timeout discussion

2016-09-21 Thread gordon chung


On 21/09/16 01:43 AM, Zhai, Edwin wrote:
> All,
>
> I'd like make some clarification for the event-alarm timeout design as
> many of you have some misunderstanding here. Pls. correct me if any
> mistakes.
>
> I realized that there are 2 different things, but we mix them sometime:
> 1. event-timeout-alarm
> This is one new type of alarm that bracket *.start and *.end events and
> get alarmed when receive *.start but no *.end in timeout. This new alarm
> handles one type of events/actions, e.g. create one alarm for instance
> creation, then all instances created in future will be handled by this
> alarm. This is not for real time, so it's acceptable that user know one
> instance creation failure in 5 mins.
>
> This new type of alarm can be implemented by one worker to check the DB
> periodically to do the statistic work. That is, new evaluator works in
> 'polling' mode, something like threshold alarm evaluator.
>
> One BP is @
> https://review.openstack.org/#/c/199005/

we should probably disregard this bp since it was assumed you guys 
talked over it. i'm abandoning it as i think we just forgot about it.

>
> 2. event-alarm timeout
> This is one new feature for _existed_ event-alarm evaluator. One alarm
> becomes 'UNALARM' when not receive desire event in timeout. This feature
> just handles one specific event, e.g create one alarm for instance ABC's
> XYZ operation with 5s, then user is notified in 5s immediately if no
> XYZ.done event comes. If want check for another instance, we need create
> another alarm.
>
> This is used in telco scenario, where operator want know if operation
> failure in real time.
>
> My patch(https://review.openstack.org/#/c/272028/) is for this purpose
> only, but I feel many guys mistaken them(sometimes even me) as they
> looks similar. So my question is: Do you think this telco usage model of
> event-alarm timeout is valid? If not, we can avoid discussing its
> implementation and ignore following.
>
>
> === event-alarm timeout implementation =
> As it's for event-alarm, we need keep it as event-driven. Furthermore,
> for quick response, we need use event for timeout handling. Periodic
> worker can't meet real time requirement.
>
> Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads
> tricky race condition.  e.g.  'XYZ.done' comes in queue1, and
> 'alarm.timeout.end' comes in queue2, so that they are handled in
> parallel way:
>
> 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and
> will be set ALARM in next step.
> 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current
> UNKNOWN), and will be set to OK(UNALARM) in next step.
> 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
> 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)
>


can you clarify how this work? after user creates event timeout alarm 
definition through API (i assume the alarm definition specify we should 
see event x within y seconds).
- how does the evaluator get this alarm definition? is there an 
alarm.timeout.start message?
- what is this UNALARM state? to be honest, that isn't a real word so i 
don't know what it's suppose to represent here.

biggest problem for me is the only thing i know is there's a 
alarm.timeout.end event that needs to be handled by evaluator. i don't 
know where it's coming from or what it's needed for.


> So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells
> the user: required event came, then no required event came;
>
> If put all events in one queue, evaluator handles them one by one(low
> level oslo mesg should be multi-threaded) so that second event would see
> alarm state as not UNKNOWN, and give up its transition.  As Gordc said,
> it's slow. But only very small part of the event-alarm need timeout
> handling, as it's only for telco usage model.

so the multithreaded part is what i was talking about. it's not handling 
them one by one. it's handling 64 (or whatever the default is) at any 
given time. whether its' one queue or two, you have a race to handle.

>
> One possible improvement as JD pointed out is to avoid so many spawned
> thread. We can just create one thread inside evaluator, and ask this
> thread handle all timeout requests from evaluator. Is it acceptable for
> event-alarm timeout solution?
>
>
> Best Rgds,
> Edwin

-- 
gord

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [AODH] event-alarm timeout discussion

2016-09-20 Thread Zhai, Edwin

All,

I'd like make some clarification for the event-alarm timeout design as many of 
you have some misunderstanding here. Pls. correct me if any mistakes.


I realized that there are 2 different things, but we mix them sometime:
1. event-timeout-alarm
This is one new type of alarm that bracket *.start and *.end events and get 
alarmed when receive *.start but no *.end in timeout. This new alarm handles one 
type of events/actions, e.g. create one alarm for instance creation, then all 
instances created in future will be handled by this alarm. This is not for real 
time, so it's acceptable that user know one instance creation failure in 5 mins.


This new type of alarm can be implemented by one worker to check the DB 
periodically to do the statistic work. That is, new evaluator works in 'polling' 
mode, something like threshold alarm evaluator.


One BP is @
https://review.openstack.org/#/c/199005/

2. event-alarm timeout
This is one new feature for _existed_ event-alarm evaluator. One alarm becomes 
'UNALARM' when not receive desire event in timeout. This feature just handles 
one specific event, e.g create one alarm for instance ABC's XYZ operation with 
5s, then user is notified in 5s immediately if no XYZ.done event comes. If want 
check for another instance, we need create another alarm.


This is used in telco scenario, where operator want know if operation failure in 
real time.


My patch(https://review.openstack.org/#/c/272028/) is for this purpose only, but 
I feel many guys mistaken them(sometimes even me) as they looks similar. So my 
question is: Do you think this telco usage model of event-alarm timeout is 
valid? If not, we can avoid discussing its implementation and ignore following.



=== event-alarm timeout implementation =
As it's for event-alarm, we need keep it as event-driven. Furthermore, for quick 
response, we need use event for timeout handling. Periodic worker can't meet 
real time requirement.


Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads tricky 
race condition.  e.g.  'XYZ.done' comes in queue1, and 'alarm.timeout.end' comes 
in queue2, so that they are handled in parallel way:


1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and will be 
set ALARM in next step.
2. In queue2, 'alarm.timeout.end' is checking against same alarm(current 
UNKNOWN), and will be set to OK(UNALARM) in next step.

3. In qeueu1, alarm transition happen: UNKNOWN => ALARM
4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM)

So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells the user: 
required event came, then no required event came;


If put all events in one queue, evaluator handles them one by one(low level oslo 
mesg should be multi-threaded) so that second event would see alarm state as not 
UNKNOWN, and give up its transition.  As Gordc said, it's slow. But only very 
small part of the event-alarm need timeout handling, as it's only for telco 
usage model.


One possible improvement as JD pointed out is to avoid so many spawned thread. 
We can just create one thread inside evaluator, and ask this thread handle all 
timeout requests from evaluator. Is it acceptable for event-alarm timeout 
solution?



Best Rgds,
Edwin

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev