Re: [openstack-dev] [AODH] event-alarm timeout discussion
On Mon, Sep 26 2016, Zhai, Edwin wrote: > Do you have any idea to resolve this race condition? So following our discussion at the summit, I went ahead and wrote a spec about that feature: https://review.openstack.org/#/c/395058/ Comments are welcome obviously! Cheers, -- Julien Danjou # Free Software hacker # https://julien.danjou.info signature.asc Description: PGP signature __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [AODH] event-alarm timeout discussion
On Fri, 23 Sep 2016, gordon chung wrote: On 23/09/2016 2:18 AM, Zhai, Edwin wrote: There are many targets(topics)/endpoints in above ceilometer code. But in AODH, we just have one topic, 'alarm.all', and one endpoint. If it is still multi-threaded, there is already potential race condition here, but event-alarm tiemout make it worse. https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63 see my reply to other message, but yes, it is multithreaded. there's not race currently because we don't do anything that needs to honour ordering. Currently, we still need ordering. e.g. 2 events with different traits could trigger same alarm. If they come in an interval big enough, the alarm would be triggered once(Second event see the state as 'ALARM' and give up). If they come and is handled concurrently, the alarm possibly be triggered twice(Both event see the state as 'UNKNOWN'). This is wrong as event alarm is one-shot(if repeat_actions=False). Do you have any idea to resolve this race condition? event evaluator is triggered by event only, that is, it's not called at all until next event comes. If no event comes, evaluator just sleeps so that can't check timeout and update_alarm. In other words, 'timeout.end' is just for waking up evaluator. what's the purpose of the thread being created? i thought the idea was to receive alarm.timeout.start event -> creates a thread? can we not: 1. receive alarm.timeout.start -> create an alarm with timeout thread 2a. if event received, kill timeout thread, update alarm. 2b. if timeout reached, send alarm notification, update alarm. ^ that is just a random thought, i didn't think about exactly how to implement. right now i'm not clear who is generating this alarm.timeout.end event and why it needs to do that at all. It's good idea! We need one way for timeout calculation: new thread, or alarm signal. If alarm signal is more stable, let's turn to it. We need one list to keep all alarms waiting for timeout, and update the list when timeout signal reached. alarm.timeout.end event is just for locking, and generated by new thread or alarm signal handler(your suggestion). If it is useless for locking, we can give up and just update alarm directly as you said. cheers, -- gord Best Rgds, Edwin __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [AODH] event-alarm timeout discussion
On 23/09/2016 2:18 AM, Zhai, Edwin wrote: > > There are many targets(topics)/endpoints in above ceilometer code. But > in AODH, we just have one topic, 'alarm.all', and one endpoint. If it is > still multi-threaded, there is already potential race condition here, > but event-alarm tiemout make it worse. > > https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63 see my reply to other message, but yes, it is multithreaded. there's not race currently because we don't do anything that needs to honour ordering. > > event evaluator is triggered by event only, that is, it's not called at > all until next event comes. If no event comes, evaluator just sleeps so > that can't check timeout and update_alarm. In other words, 'timeout.end' > is just for waking up evaluator. > what's the purpose of the thread being created? i thought the idea was to receive alarm.timeout.start event -> creates a thread? can we not: 1. receive alarm.timeout.start -> create an alarm with timeout thread 2a. if event received, kill timeout thread, update alarm. 2b. if timeout reached, send alarm notification, update alarm. ^ that is just a random thought, i didn't think about exactly how to implement. right now i'm not clear who is generating this alarm.timeout.end event and why it needs to do that at all. cheers, -- gord __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [AODH] event-alarm timeout discussion
On 23/09/2016 3:19 AM, Zhai, Edwin wrote: > "Each notification listener is associated with an executor which > controls how incoming notification messages will be received and > dispatched. By default, the most simple executor is used - the blocking > executor. This executor processes inbound notifications on the server’s > thread, blocking it from processing additional notifications until it > finishes with the current one." we use threading executor[1] not blocking. unless something has changed with oslo.messaging recently, the behaviour is: you have 64 threads grabbing messages off queue. because it's all threads and we don't really know when they context switch so you can't guarantee order (unless you switch to one thread). [1] https://github.com/openstack/aodh/blob/9c5df8400b621141db93a259900efe3702eb6241/aodh/messaging.py#L54 cheers, -- gord __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [AODH] event-alarm timeout discussion
Just check oslo messaging doc, don't know if it's out of date. "Each notification listener is associated with an executor which controls how incoming notification messages will be received and dispatched. By default, the most simple executor is used - the blocking executor. This executor processes inbound notifications on the server’s thread, blocking it from processing additional notifications until it finishes with the current one." Note: If the “eventlet” executor is used, the threading and time library need to be monkeypatched. On Fri, 23 Sep 2016, Zhai, Edwin wrote: Thanks for your clarification, see my comments below. On Thu, 22 Sep 2016, gordon chung wrote: On 22/09/2016 2:40 AM, Zhai, Edwin wrote: See https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158 evaluate_events is the handler of the endpoint for 'alarm.all', it iterates the event list and evaluate them one by one with project alarms. If both 'timeout.end' and 'X' are in the event list, I assume they are handled in sequence at different iterations of for loop. Am I right? not exactly. the code above is actually an endpoint for event listener. the event listener itself is threaded so in theory, we have 64 of these endpoints/loops. you can override the threads to have just one but that's where things slow down a lot. we handle this in ceilometer by having many single thread listeners each handling it's own queue[1]. i still need to publish diagram on how that works. [1] https://github.com/openstack/ceilometer/blob/master/ceilometer/notification.py#L308 There are many targets(topics)/endpoints in above ceilometer code. But in AODH, we just have one topic, 'alarm.all', and one endpoint. If it is still multi-threaded, there is already potential race condition here, but event-alarm tiemout make it worse. https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63 deleted your sequence diagram since it's malformed in my response but that is pretty cool. a few questions: - when alarm creation event arrives at evaluator it creates a thread to process alarm. this thread will timeout and raise a new event if it doesn't receive event in time? i don't understand why we need a timeout.end event? can the evaluator not just update_alarm and notify if we timeout? or update_alarm and skip notify if we receive event on time? event evaluator is triggered by event only, that is, it's not called at all until next event comes. If no event comes, evaluator just sleeps so that can't check timeout and update_alarm. In other words, 'timeout.end' is just for waking up evaluator. cheers, -- gord Best Rgds, Edwin Best Rgds, Edwin__ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [AODH] event-alarm timeout discussion
Thanks for your clarification, see my comments below. On Thu, 22 Sep 2016, gordon chung wrote: On 22/09/2016 2:40 AM, Zhai, Edwin wrote: See https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158 evaluate_events is the handler of the endpoint for 'alarm.all', it iterates the event list and evaluate them one by one with project alarms. If both 'timeout.end' and 'X' are in the event list, I assume they are handled in sequence at different iterations of for loop. Am I right? not exactly. the code above is actually an endpoint for event listener. the event listener itself is threaded so in theory, we have 64 of these endpoints/loops. you can override the threads to have just one but that's where things slow down a lot. we handle this in ceilometer by having many single thread listeners each handling it's own queue[1]. i still need to publish diagram on how that works. [1] https://github.com/openstack/ceilometer/blob/master/ceilometer/notification.py#L308 There are many targets(topics)/endpoints in above ceilometer code. But in AODH, we just have one topic, 'alarm.all', and one endpoint. If it is still multi-threaded, there is already potential race condition here, but event-alarm tiemout make it worse. https://github.com/openstack/aodh/blob/master/aodh/event.py#L61-L63 deleted your sequence diagram since it's malformed in my response but that is pretty cool. a few questions: - when alarm creation event arrives at evaluator it creates a thread to process alarm. this thread will timeout and raise a new event if it doesn't receive event in time? i don't understand why we need a timeout.end event? can the evaluator not just update_alarm and notify if we timeout? or update_alarm and skip notify if we receive event on time? event evaluator is triggered by event only, that is, it's not called at all until next event comes. If no event comes, evaluator just sleeps so that can't check timeout and update_alarm. In other words, 'timeout.end' is just for waking up evaluator. cheers, -- gord Best Rgds, Edwin __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [AODH] event-alarm timeout discussion
On 22/09/2016 2:40 AM, Zhai, Edwin wrote: > > See > https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158 > > evaluate_events is the handler of the endpoint for 'alarm.all', it > iterates the event list and evaluate them one by one with project > alarms. If both 'timeout.end' and 'X' are in the event list, I assume > they are handled in sequence at different iterations of for loop. Am I > right? not exactly. the code above is actually an endpoint for event listener. the event listener itself is threaded so in theory, we have 64 of these endpoints/loops. you can override the threads to have just one but that's where things slow down a lot. we handle this in ceilometer by having many single thread listeners each handling it's own queue[1]. i still need to publish diagram on how that works. [1] https://github.com/openstack/ceilometer/blob/master/ceilometer/notification.py#L308 > > If we have evaluate_timeout_events as handler of another endpoint for > 'alarm.timeout', then 2 handlers can run concurrently to lead race > condition. I'm not familiar with underline oslo notifications, and think > separated queue is different story. Pls. correct me if I'm wrong. > deleted your sequence diagram since it's malformed in my response but that is pretty cool. a few questions: - when alarm creation event arrives at evaluator it creates a thread to process alarm. this thread will timeout and raise a new event if it doesn't receive event in time? i don't understand why we need a timeout.end event? can the evaluator not just update_alarm and notify if we timeout? or update_alarm and skip notify if we receive event on time? cheers, -- gord __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [AODH] event-alarm timeout discussion
Gordon, Thanks for your comments. Pls. check my answer and flow chart below. On Wed, 21 Sep 2016, gordon chung wrote: === event-alarm timeout implementation = As it's for event-alarm, we need keep it as event-driven. Furthermore, for quick response, we need use event for timeout handling. Periodic worker can't meet real time requirement. Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads tricky race condition. e.g. 'XYZ.done' comes in queue1, and 'alarm.timeout.end' comes in queue2, so that they are handled in parallel way: 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and will be set ALARM in next step. 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current UNKNOWN), and will be set to OK(UNALARM) in next step. 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM) can you clarify how this work? after user creates event timeout alarm definition through API (i assume the alarm definition specify we should see event x within y seconds). - how does the evaluator get this alarm definition? is there an alarm.timeout.start message? Yes. - what is this UNALARM state? to be honest, that isn't a real word so i don't know what it's suppose to represent here. It's OK - mean we have enough data to say: not trigger this alarm. Somebody mistaken it by ALARM, so I mark it as UN-ALARM. biggest problem for me is the only thing i know is there's a alarm.timeout.end event that needs to be handled by evaluator. i don't know where it's coming from or what it's needed for. I attached flow chart at bottom . pls. check it. 'timeout.end' and event 'X' comes in different ways, it's good if evaluator do not touch next one until previous one was handled. So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells the user: required event came, then no required event came; If put all events in one queue, evaluator handles them one by one(low level oslo mesg should be multi-threaded) so that second event would see alarm state as not UNKNOWN, and give up its transition. As Gordc said, it's slow. But only very small part of the event-alarm need timeout handling, as it's only for telco usage model. so the multithreaded part is what i was talking about. it's not handling them one by one. it's handling 64 (or whatever the default is) at any given time. whether its' one queue or two, you have a race to handle. See https://github.com/openstack/aodh/blob/master/aodh/evaluator/event.py#L158 evaluate_events is the handler of the endpoint for 'alarm.all', it iterates the event list and evaluate them one by one with project alarms. If both 'timeout.end' and 'X' are in the event list, I assume they are handled in sequence at different iterations of for loop. Am I right? If we have evaluate_timeout_events as handler of another endpoint for 'alarm.timeout', then 2 handlers can run concurrently to lead race condition. I'm not familiar with underline oslo notifications, and think separated queue is different story. Pls. correct me if I'm wrong. for e in events: try: event = Event(e) .. for id, alarm in six.iteritems( self._get_project_alarms(event.project)): try: self._evaluate_alarm(alarm, event) ... +--+ ++ +--+ ++ +---+ ++ | User| | API server | | Notification bus | | Evaluator | | Threads | | Alarm state| +--+---+ +-+--+ ++-+ +-+--+ ++--+ +--+-+ | | | | || +---> | | | || | +-+ | | | || | |Alarm create | | | | |+---+ | |event: X | | | | || UNKNOWN | | |timeout: 5s | | | | |+---+ | +-+ | | | || | +-> | || | | +-+ | | || | | |Event sent: | | | || |
Re: [openstack-dev] [AODH] event-alarm timeout discussion
On 21/09/16 01:43 AM, Zhai, Edwin wrote: > All, > > I'd like make some clarification for the event-alarm timeout design as > many of you have some misunderstanding here. Pls. correct me if any > mistakes. > > I realized that there are 2 different things, but we mix them sometime: > 1. event-timeout-alarm > This is one new type of alarm that bracket *.start and *.end events and > get alarmed when receive *.start but no *.end in timeout. This new alarm > handles one type of events/actions, e.g. create one alarm for instance > creation, then all instances created in future will be handled by this > alarm. This is not for real time, so it's acceptable that user know one > instance creation failure in 5 mins. > > This new type of alarm can be implemented by one worker to check the DB > periodically to do the statistic work. That is, new evaluator works in > 'polling' mode, something like threshold alarm evaluator. > > One BP is @ > https://review.openstack.org/#/c/199005/ we should probably disregard this bp since it was assumed you guys talked over it. i'm abandoning it as i think we just forgot about it. > > 2. event-alarm timeout > This is one new feature for _existed_ event-alarm evaluator. One alarm > becomes 'UNALARM' when not receive desire event in timeout. This feature > just handles one specific event, e.g create one alarm for instance ABC's > XYZ operation with 5s, then user is notified in 5s immediately if no > XYZ.done event comes. If want check for another instance, we need create > another alarm. > > This is used in telco scenario, where operator want know if operation > failure in real time. > > My patch(https://review.openstack.org/#/c/272028/) is for this purpose > only, but I feel many guys mistaken them(sometimes even me) as they > looks similar. So my question is: Do you think this telco usage model of > event-alarm timeout is valid? If not, we can avoid discussing its > implementation and ignore following. > > > === event-alarm timeout implementation = > As it's for event-alarm, we need keep it as event-driven. Furthermore, > for quick response, we need use event for timeout handling. Periodic > worker can't meet real time requirement. > > Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads > tricky race condition. e.g. 'XYZ.done' comes in queue1, and > 'alarm.timeout.end' comes in queue2, so that they are handled in > parallel way: > > 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and > will be set ALARM in next step. > 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current > UNKNOWN), and will be set to OK(UNALARM) in next step. > 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM > 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM) > can you clarify how this work? after user creates event timeout alarm definition through API (i assume the alarm definition specify we should see event x within y seconds). - how does the evaluator get this alarm definition? is there an alarm.timeout.start message? - what is this UNALARM state? to be honest, that isn't a real word so i don't know what it's suppose to represent here. biggest problem for me is the only thing i know is there's a alarm.timeout.end event that needs to be handled by evaluator. i don't know where it's coming from or what it's needed for. > So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells > the user: required event came, then no required event came; > > If put all events in one queue, evaluator handles them one by one(low > level oslo mesg should be multi-threaded) so that second event would see > alarm state as not UNKNOWN, and give up its transition. As Gordc said, > it's slow. But only very small part of the event-alarm need timeout > handling, as it's only for telco usage model. so the multithreaded part is what i was talking about. it's not handling them one by one. it's handling 64 (or whatever the default is) at any given time. whether its' one queue or two, you have a race to handle. > > One possible improvement as JD pointed out is to avoid so many spawned > thread. We can just create one thread inside evaluator, and ask this > thread handle all timeout requests from evaluator. Is it acceptable for > event-alarm timeout solution? > > > Best Rgds, > Edwin -- gord __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [AODH] event-alarm timeout discussion
All, I'd like make some clarification for the event-alarm timeout design as many of you have some misunderstanding here. Pls. correct me if any mistakes. I realized that there are 2 different things, but we mix them sometime: 1. event-timeout-alarm This is one new type of alarm that bracket *.start and *.end events and get alarmed when receive *.start but no *.end in timeout. This new alarm handles one type of events/actions, e.g. create one alarm for instance creation, then all instances created in future will be handled by this alarm. This is not for real time, so it's acceptable that user know one instance creation failure in 5 mins. This new type of alarm can be implemented by one worker to check the DB periodically to do the statistic work. That is, new evaluator works in 'polling' mode, something like threshold alarm evaluator. One BP is @ https://review.openstack.org/#/c/199005/ 2. event-alarm timeout This is one new feature for _existed_ event-alarm evaluator. One alarm becomes 'UNALARM' when not receive desire event in timeout. This feature just handles one specific event, e.g create one alarm for instance ABC's XYZ operation with 5s, then user is notified in 5s immediately if no XYZ.done event comes. If want check for another instance, we need create another alarm. This is used in telco scenario, where operator want know if operation failure in real time. My patch(https://review.openstack.org/#/c/272028/) is for this purpose only, but I feel many guys mistaken them(sometimes even me) as they looks similar. So my question is: Do you think this telco usage model of event-alarm timeout is valid? If not, we can avoid discussing its implementation and ignore following. === event-alarm timeout implementation = As it's for event-alarm, we need keep it as event-driven. Furthermore, for quick response, we need use event for timeout handling. Periodic worker can't meet real time requirement. Separated queue for 'alarm.timeout.end'(indicates timeout expire) leads tricky race condition. e.g. 'XYZ.done' comes in queue1, and 'alarm.timeout.end' comes in queue2, so that they are handled in parallel way: 1. In queue1, 'XYZ.done' is checking against alarm(current UNKNOWN), and will be set ALARM in next step. 2. In queue2, 'alarm.timeout.end' is checking against same alarm(current UNKNOWN), and will be set to OK(UNALARM) in next step. 3. In qeueu1, alarm transition happen: UNKNOWN => ALARM 4. In queue2, another alarm transition happen: ALARM =>OK(UNALARM) So this alarm has bogus transition: UNKNOWN=>ALARM=>UNALARM, and tells the user: required event came, then no required event came; If put all events in one queue, evaluator handles them one by one(low level oslo mesg should be multi-threaded) so that second event would see alarm state as not UNKNOWN, and give up its transition. As Gordc said, it's slow. But only very small part of the event-alarm need timeout handling, as it's only for telco usage model. One possible improvement as JD pointed out is to avoid so many spawned thread. We can just create one thread inside evaluator, and ask this thread handle all timeout requests from evaluator. Is it acceptable for event-alarm timeout solution? Best Rgds, Edwin __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev