Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-12-01 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-12-01 13:05:42 -0800:
> On 13/11/14 13:59, Clint Byrum wrote:
> > I'm not sure we have the same understanding of AMQP, so hopefully we can
> > clarify here. This stackoverflow answer echoes my understanding:
> >
> > http://stackoverflow.com/questions/17841843/rabbitmq-does-one-consumer-block-the-other-consumers-of-the-same-queue
> >
> > Not ack'ing just means they might get retransmitted if we never ack. It
> > doesn't block other consumers. And as the link above quotes from the
> > AMQP spec, when there are multiple consumers, FIFO is not guaranteed.
> > Other consumers get other messages.
> 
> Thanks, obviously my recollection of how AMQP works was coloured too 
> much by oslo.messaging.
> 
> > So just add the ability for a consumer to read, work, ack to
> > oslo.messaging, and this is mostly handled via AMQP. Of course that
> > also likely means no zeromq for Heat without accepting that messages
> > may be lost if workers die.
> >
> > Basically we need to add something that is not "RPC" but instead
> > "jobqueue" that mimics this:
> >
> > http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/rpc/dispatcher.py#n131
> >
> > I've always been suspicious of this bit of code, as it basically means
> > that if anything fails between that call, and the one below it, we have
> > lost contact, but as long as clients are written to re-send when there
> > is a lack of reply, there shouldn't be a problem. But, for a job queue,
> > there is no reply, and so the worker would dispatch, and then
> > acknowledge after the dispatched call had returned (including having
> > completed the step where new messages are added to the queue for any
> > newly-possible children).
> 
> I'm curious how people are deploying Rabbit at the moment. Are they 
> setting up multiple brokers and writing messages to disk before 
> accepting them? I assume yes on the former but no on the latter, since 
> there's no particular point in having e.g. 5 nines durability in the 
> queue when the overall system is as weak as your flakiest node.
> 

Usually the pseudo-code should be:

msg = queue.read()
do_something_idempotent_with(msg.payload)
msg.ack()

The idea is to ack only after you've done _everything_ with the payload,
but to not freak out if somebody already did _some_ of what you did with
the payload.

> OTOH if we were to add what you're proposing, then we would need folks 
> to deploy Rabbit that way (at least for Heat), since waiting for Acks on 
> receipt is insufficient to make messaging reliable if the broker can 
> easily outright lose the message.
> 

If you ask RabbitMQ to make a message durable, it writes it to a durable
queue storage. If your broker is in a cluster, it makes sure it's
written into _many_ queue storages.

Currently if you deploy TripleO w/ 3 controllers, you get a clustered
RabbitMQ and sufficient durability for the pattern I cited. Users may
not be deploying this way, but they should be.

I'm sort of assuming qpid's clustering works the same. 0mq will likely
not work at all for this. Other options are feasible too, like a simple
redis queue that you abuse as a job queue.

> I think all of the proposed approaches would benefit from this feature, 
> but I'm concerned about any increased burden on deployers too.

Right now they have the burden of supporting coarse timeouts which seems
like it will fail often. That seems worse in my head.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-12-01 Thread Zane Bitter

On 13/11/14 13:59, Clint Byrum wrote:

I'm not sure we have the same understanding of AMQP, so hopefully we can
clarify here. This stackoverflow answer echoes my understanding:

http://stackoverflow.com/questions/17841843/rabbitmq-does-one-consumer-block-the-other-consumers-of-the-same-queue

Not ack'ing just means they might get retransmitted if we never ack. It
doesn't block other consumers. And as the link above quotes from the
AMQP spec, when there are multiple consumers, FIFO is not guaranteed.
Other consumers get other messages.


Thanks, obviously my recollection of how AMQP works was coloured too 
much by oslo.messaging.



So just add the ability for a consumer to read, work, ack to
oslo.messaging, and this is mostly handled via AMQP. Of course that
also likely means no zeromq for Heat without accepting that messages
may be lost if workers die.

Basically we need to add something that is not "RPC" but instead
"jobqueue" that mimics this:

http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/rpc/dispatcher.py#n131

I've always been suspicious of this bit of code, as it basically means
that if anything fails between that call, and the one below it, we have
lost contact, but as long as clients are written to re-send when there
is a lack of reply, there shouldn't be a problem. But, for a job queue,
there is no reply, and so the worker would dispatch, and then
acknowledge after the dispatched call had returned (including having
completed the step where new messages are added to the queue for any
newly-possible children).


I'm curious how people are deploying Rabbit at the moment. Are they 
setting up multiple brokers and writing messages to disk before 
accepting them? I assume yes on the former but no on the latter, since 
there's no particular point in having e.g. 5 nines durability in the 
queue when the overall system is as weak as your flakiest node.


OTOH if we were to add what you're proposing, then we would need folks 
to deploy Rabbit that way (at least for Heat), since waiting for Acks on 
receipt is insufficient to make messaging reliable if the broker can 
easily outright lose the message.


I think all of the proposed approaches would benefit from this feature, 
but I'm concerned about any increased burden on deployers too.


cheers,
Zane.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-14 Thread Joshua Harlow
Arg, sorry for the spam, mail.app was still trying to send it multiple times 
for some reason...
-Josh
 
 From: Joshua Harlow 
 To: OpenStack Development Mailing List (not for usage questions) 
 
 Sent: Friday, November 14, 2014 11:45 AM
 Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
   
Sounds like this is tooz[1] ;)

The api for tooz (soon to be an oslo library @ 
https://review.openstack.org/#/c/122439/) is around coordination and 
'service-group' like behavior so I hope we don't have duplicates of this in 
'oslo.healthcheck' instead of just using/contributing to tooz instead.

https://github.com/stackforge/tooz/blob/master/tooz/coordination.py#L63

CoordinationDriver
- watch_join_group
- unwatch_join_group
- join_group
- get_members
- ...

Tooz has backends that use [redis, zookeeper, memcache] to achieve the above 
API (it also has some locking support for distributed locks as well).

Feel free to jump on #openstack-state-management if u want more info (jd and 
the enovance guys and myself have developed that library for this kind of 
purpose).

-josh



On Nov 13, 2014, at 10:58 PM, Jastrzebski, Michal 
 wrote:

> Also, on "Common approach to HA" session we moved something like 
> oslo.healthcheck (or whatever it will be called), common lib for 
> service-group like behavior. In my opinion it's pointless to implement 
> zookeeper management in every project separately (its already in nova..). 
> Might be worth looking closely into this topic.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


   ___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-14 Thread Joshua Harlow
Sounds like this is tooz[1] ;)

The api for tooz (soon to be an oslo library @ 
https://review.openstack.org/#/c/122439/) is around coordination and 
'service-group' like behavior so I hope we don't have duplicates of this in 
'oslo.healthcheck' instead of just using/contributing to tooz instead.

https://github.com/stackforge/tooz/blob/master/tooz/coordination.py#L63

CoordinationDriver
- watch_join_group
- unwatch_join_group
- join_group
- get_members
- ...

Tooz has backends that use [redis, zookeeper, memcache] to achieve the above 
API (it also has some locking support for distributed locks as well).

Feel free to jump on #openstack-state-management if u want more info (jd and 
the enovance guys and myself have developed that library for this kind of 
purpose).

-josh

On Nov 13, 2014, at 10:58 PM, Jastrzebski, Michal 
 wrote:

> Also, on "Common approach to HA" session we moved something like 
> oslo.healthcheck (or whatever it will be called), common lib for 
> service-group like behavior. In my opinion it's pointless to implement 
> zookeeper management in every project separately (its already in nova..). 
> Might be worth looking closely into this topic.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-14 Thread Joshua Harlow
Sounds like this is tooz[1] ;)

The api for tooz (soon to be an oslo library @ 
https://review.openstack.org/#/c/122439/) is around coordination and 
'service-group' like behavior so I hope we don't have duplicates of this in 
'oslo.healthcheck' instead of just using/contributing to tooz instead.

https://github.com/stackforge/tooz/blob/master/tooz/coordination.py#L63

CoordinationDriver
- watch_join_group
- unwatch_join_group
- join_group
- get_members
- ...

Tooz has backends that use [redis, zookeeper, memcache] to achieve the above 
API (it also has some locking support for distributed locks as well).

Feel free to jump on #openstack-state-management if u want more info (jd and 
the enovance guys and myself have developed that library for this kind of 
purpose).

-josh

On Nov 13, 2014, at 10:58 PM, Jastrzebski, Michal 
 wrote:

> Also, on "Common approach to HA" session we moved something like 
> oslo.healthcheck (or whatever it will be called), common lib for 
> service-group like behavior. In my opinion it's pointless to implement 
> zookeeper management in every project separately (its already in nova..). 
> Might be worth looking closely into this topic.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-14 Thread Joshua Harlow
Sounds like this is tooz[1] ;)

The api for tooz (soon to be an oslo library @ 
https://review.openstack.org/#/c/122439/) is around coordination and 
'service-group' like behavior so I hope we don't have duplicates of this in 
'oslo.healthcheck' instead of just using/contributing to tooz instead.

https://github.com/stackforge/tooz/blob/master/tooz/coordination.py#L63

CoordinationDriver
- watch_join_group
- unwatch_join_group
- join_group
- get_members
- ...

Tooz has backends that use [redis, zookeeper, memcache] to achieve the above 
API (it also has some locking support for distributed locks as well).

Feel free to jump on #openstack-state-management if u want more info (jd and 
the enovance guys and myself have developed that library for this kind of 
purpose).

-josh

On Nov 13, 2014, at 10:58 PM, Jastrzebski, Michal 
 wrote:

> Also, on "Common approach to HA" session we moved something like 
> oslo.healthcheck (or whatever it will be called), common lib for 
> service-group like behavior. In my opinion it's pointless to implement 
> zookeeper management in every project separately (its already in nova..). 
> Might be worth looking closely into this topic.


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Jastrzebski, Michal


> -Original Message-
> From: Joshua Harlow [mailto:harlo...@outlook.com]
> Sent: Thursday, November 13, 2014 10:50 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
> 
> On Nov 13, 2014, at 10:59 AM, Clint Byrum  wrote:
> 
> > Excerpts from Zane Bitter's message of 2014-11-13 09:55:43 -0800:
> >> On 13/11/14 09:58, Clint Byrum wrote:
> >>> Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
> >>>> On 13/11/14 03:29, Murugan, Visnusaran wrote:
> >>>>> Hi all,
> >>>>>
> >>>>> Convergence-POC distributes stack operations by sending resource
> >>>>> actions over RPC for any heat-engine to execute. Entire stack
> >>>>> lifecycle will be controlled by worker/observer notifications.
> >>>>> This distributed model has its own advantages and disadvantages.
> >>>>>
> >>>>> Any stack operation has a timeout and a single engine will be
> >>>>> responsible for it. If that engine goes down, timeout is lost
> >>>>> along with it. So a traditional way is for other engines to
> >>>>> recreate timeout from scratch. Also a missed resource action
> >>>>> notification will be detected only when stack operation timeout
> happens.
> >>>>>
> >>>>> To overcome this, we will need the following capability:
> >>>>>
> >>>>> 1.Resource timeout (can be used for retry)
> >>>>
> >>>> I don't believe this is strictly needed for phase 1 (essentially we
> >>>> don't have it now, so nothing gets worse).
> >>>>
> >>>
> >>> We do have a stack timeout, and it stands to reason that we won't
> >>> have a single box with a timeout greenthread after this, so a
> >>> strategy is needed.
> >>
> >> Right, that was 2, but I was talking specifically about the resource
> >> retry. I think we agree on both points.
> >>
> >>>> For phase 2, yes, we'll want it. One thing we haven't discussed
> >>>> much is that if we used Zaqar for this then the observer could
> >>>> claim a message but not acknowledge it until it had processed it,
> >>>> so we could have guaranteed delivery.
> >>>>
> >>>
> >>> Frankly, if oslo.messaging doesn't support reliable delivery then we
> >>> need to add it.
> >>
> >> That is straight-up impossible with AMQP. Either you ack the message
> >> and risk losing it if the worker dies before processing is complete,
> >> or you don't ack the message until it's processed and you become a
> >> blocker for every other worker trying to pull jobs off the queue. It
> >> works fine when you have only one worker; otherwise not so much. This
> >> is the crux of the whole "why isn't Zaqar just Rabbit" debate.
> >>
> >
> > I'm not sure we have the same understanding of AMQP, so hopefully we
> > can clarify here. This stackoverflow answer echoes my understanding:
> >
> > http://stackoverflow.com/questions/17841843/rabbitmq-does-one-
> consumer
> > -block-the-other-consumers-of-the-same-queue
> >
> > Not ack'ing just means they might get retransmitted if we never ack.
> > It doesn't block other consumers. And as the link above quotes from
> > the AMQP spec, when there are multiple consumers, FIFO is not
> guaranteed.
> > Other consumers get other messages.
> >
> > So just add the ability for a consumer to read, work, ack to
> > oslo.messaging, and this is mostly handled via AMQP. Of course that
> > also likely means no zeromq for Heat without accepting that messages
> > may be lost if workers die.
> >
> > Basically we need to add something that is not "RPC" but instead
> > "jobqueue" that mimics this:
> >
> > http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messa
> > ging/rpc/dispatcher.py#n131
> >
> > I've always been suspicious of this bit of code, as it basically means
> > that if anything fails between that call, and the one below it, we
> > have lost contact, but as long as clients are written to re-send when
> > there is a lack of reply, there shouldn't be a problem. But, for a job
> > queue, there is no reply, and so the worker would dispatch, and then
> > 

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Jastrzebski, Michal


> -Original Message-
> From: Clint Byrum [mailto:cl...@fewbar.com]
> Sent: Thursday, November 13, 2014 8:00 PM
> To: openstack-dev
> Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
> 
> Excerpts from Zane Bitter's message of 2014-11-13 09:55:43 -0800:
> > On 13/11/14 09:58, Clint Byrum wrote:
> > > Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
> > >> On 13/11/14 03:29, Murugan, Visnusaran wrote:
> > >>> Hi all,
> > >>>
> > >>> Convergence-POC distributes stack operations by sending resource
> > >>> actions over RPC for any heat-engine to execute. Entire stack
> > >>> lifecycle will be controlled by worker/observer notifications.
> > >>> This distributed model has its own advantages and disadvantages.
> > >>>
> > >>> Any stack operation has a timeout and a single engine will be
> > >>> responsible for it. If that engine goes down, timeout is lost
> > >>> along with it. So a traditional way is for other engines to
> > >>> recreate timeout from scratch. Also a missed resource action
> > >>> notification will be detected only when stack operation timeout
> happens.
> > >>>
> > >>> To overcome this, we will need the following capability:
> > >>>
> > >>> 1.Resource timeout (can be used for retry)
> > >>
> > >> I don't believe this is strictly needed for phase 1 (essentially we
> > >> don't have it now, so nothing gets worse).
> > >>
> > >
> > > We do have a stack timeout, and it stands to reason that we won't
> > > have a single box with a timeout greenthread after this, so a
> > > strategy is needed.
> >
> > Right, that was 2, but I was talking specifically about the resource
> > retry. I think we agree on both points.
> >
> > >> For phase 2, yes, we'll want it. One thing we haven't discussed
> > >> much is that if we used Zaqar for this then the observer could
> > >> claim a message but not acknowledge it until it had processed it,
> > >> so we could have guaranteed delivery.
> > >>
> > >
> > > Frankly, if oslo.messaging doesn't support reliable delivery then we
> > > need to add it.
> >
> > That is straight-up impossible with AMQP. Either you ack the message
> > and risk losing it if the worker dies before processing is complete,
> > or you don't ack the message until it's processed and you become a
> > blocker for every other worker trying to pull jobs off the queue. It
> > works fine when you have only one worker; otherwise not so much. This
> > is the crux of the whole "why isn't Zaqar just Rabbit" debate.
> >
> 
> I'm not sure we have the same understanding of AMQP, so hopefully we can
> clarify here. This stackoverflow answer echoes my understanding:
> 
> http://stackoverflow.com/questions/17841843/rabbitmq-does-one-
> consumer-block-the-other-consumers-of-the-same-queue
> 
> Not ack'ing just means they might get retransmitted if we never ack. It
> doesn't block other consumers. And as the link above quotes from the
> AMQP spec, when there are multiple consumers, FIFO is not guaranteed.
> Other consumers get other messages.
> 
> So just add the ability for a consumer to read, work, ack to oslo.messaging,
> and this is mostly handled via AMQP. Of course that also likely means no
> zeromq for Heat without accepting that messages may be lost if workers die.
> 
> Basically we need to add something that is not "RPC" but instead "jobqueue"
> that mimics this:
> 
> http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messagin
> g/rpc/dispatcher.py#n131
> 
> I've always been suspicious of this bit of code, as it basically means that if
> anything fails between that call, and the one below it, we have lost contact,
> but as long as clients are written to re-send when there is a lack of reply,
> there shouldn't be a problem. But, for a job queue, there is no reply, and so
> the worker would dispatch, and then acknowledge after the dispatched call
> had returned (including having completed the step where new messages are
> added to the queue for any newly-possible children).
> 
> Just to be clear, I believe what Zaqar adds is the ability to peek at a 
> specific
> message ID and not affect it in the queue, which is entirely different than
> ACK'ing the ones you've already received in your session.
> 
&

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Joshua Harlow
On Nov 13, 2014, at 4:08 PM, Clint Byrum  wrote:

> Excerpts from Joshua Harlow's message of 2014-11-13 14:01:14 -0800:
>> On Nov 13, 2014, at 7:10 AM, Clint Byrum  wrote:
>> 
>>> Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
 A question;
 
 How is using something like celery in heat vs taskflow in heat (or at 
 least concept [1]) 'to many code change'.
 
 Both seem like change of similar levels ;-)
 
>>> 
>>> I've tried a few times to dive into refactoring some things to use
>>> TaskFlow at a shallow level, and have always gotten confused and
>>> frustrated.
>>> 
>>> The amount of lines that are changed probably is the same. But the
>>> massive shift in thinking is not an easy one to make. It may be worth some
>>> thinking on providing a shorter bridge to TaskFlow adoption, because I'm
>>> a huge fan of the idea and would _start_ something with it in a heartbeat,
>>> but refactoring things to use it feels really weird to me.
>> 
>> I wonder how I can make that better...
>> 
>> Where the concepts that new/different? Maybe I just have more of a 
>> functional programming background and the way taskflow gets you to create 
>> tasks that are later executed, order them ahead of time, and then *later* 
>> run them is still a foreign concept for folks that have not done things with 
>> non-procedural languages. What were the confusion points, how may I help 
>> address them? More docs maybe, more examples, something else?
> 
> My feeling is that it is hard to let go of the language constructs that
> _seem_ to solve the problems TaskFlow does, even though in fact they are
> the problem because they're using the stack for control-flow where we
> want that control-flow to yield to TaskFlow.
> 

U know u want to let go!

> I also kind of feel like the Twisted folks answered a similar question
> with inline callbacks and made things "easier" but more complex in
> doing so. If I had a good answer I would give it to you though. :)
> 
>> 
>> I would agree that the jobboard[0] concept is different than the other parts 
>> of taskflow, but it could be useful here:
>> 
>> Basically at its core its a application of zookeeper where 'jobs' are posted 
>> to a directory (using sequenced nodes in zookeeper, so that ordering is 
>> retained). Entities then acquire ephemeral locks on those 'jobs' (these 
>> locks will be released if the owner process disconnects, or fails...) and 
>> then work on the contents of that job (where contents can be pretty much 
>> arbitrary). This creates a highly available job queue (queue-like due to the 
>> node sequencing[1]), and it sounds pretty similar to what zaqar could 
>> provide in theory (except the zookeeper one is proven, battle-hardened, 
>> works and exists...). But we should of course continue being scared of 
>> zookeeper, because u know, who wants to use a tool where it would fit, haha 
>> (this is a joke).
>> 
> 
> So ordering is a distraction from the task at hand. But the locks that
> indicate liveness of the workers is very interesting to me. Since we
> don't actually have requirements of ordering on the front-end of the task
> (we do on the completion of certain tasks, but we can use a DB for that),
> I wonder if we can just get the same effect with a durable queue that uses
> a reliable messaging pattern where we don't ack until we're done. That
> would achieve the goal of liveness.
> 

Possibly, it depends on what the message broker is doing with the message when 
the message hasn't been acked. With zookeeper being used as a queue of jobs, 
the job actually has an owner (the thing with the ephemeral lock on the job) so 
the job won't get 'taken over' by someone else unless that ephemeral lock drops 
off (due to owner dying or disconnecting...); this is where I'm not sure what 
message brokers do (varies by message broker?).

An example little taskflow program that I made that u can also run (replace my 
zookeeper server with your own).

http://paste.ubuntu.com/8995861/

You can then run like:

$ python jb.py  'producer'

And for a worker (start many of these if u want),

$ python jb.py 'c1'

Then you can see the work being produced/consumed, and u can ctrl-c 'c1' and 
then another worker will take over the work...

Something like the following should be output (by workers):

$ python jb.py 'c2'
INFO:kazoo.client:Connecting to buildingbuild.corp.yahoo.com:2181
INFO:kazoo.client:Zookeeper connection established, state: CONNECTED
Waiting for jobs to appear...
Running {u'action': u'stuff', u'id': 1}
Waiting for jobs to appear...
Running {u'action': u'stuff', u'id': 3}

For producers:

$ python jb.py  'producer'
INFO:kazoo.client:Connecting to buildingbuild.corp.yahoo.com:2181
INFO:kazoo.client:Zookeeper connection established, state: CONNECTED
Posting work item {'action': 'stuff', 'id': 0}
Posting work item {'action': 'stuff', 'id': 1}
Posting work item {'action': 'stuff', 'id': 2}
Posting work item {'action': 'stuff', 'id': 3}
P

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2014-11-13 14:01:14 -0800:
> On Nov 13, 2014, at 7:10 AM, Clint Byrum  wrote:
> 
> > Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
> >> A question;
> >> 
> >> How is using something like celery in heat vs taskflow in heat (or at 
> >> least concept [1]) 'to many code change'.
> >> 
> >> Both seem like change of similar levels ;-)
> >> 
> > 
> > I've tried a few times to dive into refactoring some things to use
> > TaskFlow at a shallow level, and have always gotten confused and
> > frustrated.
> > 
> > The amount of lines that are changed probably is the same. But the
> > massive shift in thinking is not an easy one to make. It may be worth some
> > thinking on providing a shorter bridge to TaskFlow adoption, because I'm
> > a huge fan of the idea and would _start_ something with it in a heartbeat,
> > but refactoring things to use it feels really weird to me.
> 
> I wonder how I can make that better...
> 
> Where the concepts that new/different? Maybe I just have more of a functional 
> programming background and the way taskflow gets you to create tasks that are 
> later executed, order them ahead of time, and then *later* run them is still 
> a foreign concept for folks that have not done things with non-procedural 
> languages. What were the confusion points, how may I help address them? More 
> docs maybe, more examples, something else?

My feeling is that it is hard to let go of the language constructs that
_seem_ to solve the problems TaskFlow does, even though in fact they are
the problem because they're using the stack for control-flow where we
want that control-flow to yield to TaskFlow.

I also kind of feel like the Twisted folks answered a similar question
with inline callbacks and made things "easier" but more complex in
doing so. If I had a good answer I would give it to you though. :)

> 
> I would agree that the jobboard[0] concept is different than the other parts 
> of taskflow, but it could be useful here:
> 
> Basically at its core its a application of zookeeper where 'jobs' are posted 
> to a directory (using sequenced nodes in zookeeper, so that ordering is 
> retained). Entities then acquire ephemeral locks on those 'jobs' (these locks 
> will be released if the owner process disconnects, or fails...) and then work 
> on the contents of that job (where contents can be pretty much arbitrary). 
> This creates a highly available job queue (queue-like due to the node 
> sequencing[1]), and it sounds pretty similar to what zaqar could provide in 
> theory (except the zookeeper one is proven, battle-hardened, works and 
> exists...). But we should of course continue being scared of zookeeper, 
> because u know, who wants to use a tool where it would fit, haha (this is a 
> joke).
> 

So ordering is a distraction from the task at hand. But the locks that
indicate liveness of the workers is very interesting to me. Since we
don't actually have requirements of ordering on the front-end of the task
(we do on the completion of certain tasks, but we can use a DB for that),
I wonder if we can just get the same effect with a durable queue that uses
a reliable messaging pattern where we don't ack until we're done. That
would achieve the goal of liveness.

> [0] 
> https://github.com/openstack/taskflow/blob/master/taskflow/jobs/jobboard.py#L25
>  
> 
> [1] 
> http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming
> 
> > 
> >> What was your metric for determining the code change either would have 
> >> (out of curiosity)?
> >> 
> >> Perhaps u should look at [2], although I'm unclear on what the desired 
> >> functionality is here.
> >> 
> >> Do u want the single engine to transfer its work to another engine when it 
> >> 'goes down'? If so then the jobboard model + zookeper inherently does this.
> >> 
> >> Or maybe u want something else? I'm probably confused because u seem to be 
> >> asking for resource timeouts + recover from engine failure (which seems 
> >> like a liveness issue and not a resource timeout one), those 2 things seem 
> >> separable.
> >> 
> > 
> > I agree with you on this. It is definitely a liveness problem. The
> > resource timeout isn't something I've seen discussed before. We do have
> > a stack timeout, and we need to keep on honoring that, but we can do
> > that with a job that sleeps for the stack timeout if we have a liveness
> > guarantee that will resurrect the job (with the sleep shortened by the
> > time since stack-update-time) somewhere else if the original engine
> > can't complete the job.
> > 
> >> [1] http://docs.openstack.org/developer/taskflow/jobs.html
> >> 
> >> [2] 
> >> http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
> >> 
> >> On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran 
> >>  wrote:
> >> 
> >>> Hi all,
> >>> 
> >>> Convergence-POC distributes stack operations by sending resource actions 
> >>> over RPC f

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Joshua Harlow
On Nov 13, 2014, at 7:10 AM, Clint Byrum  wrote:

> Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
>> A question;
>> 
>> How is using something like celery in heat vs taskflow in heat (or at least 
>> concept [1]) 'to many code change'.
>> 
>> Both seem like change of similar levels ;-)
>> 
> 
> I've tried a few times to dive into refactoring some things to use
> TaskFlow at a shallow level, and have always gotten confused and
> frustrated.
> 
> The amount of lines that are changed probably is the same. But the
> massive shift in thinking is not an easy one to make. It may be worth some
> thinking on providing a shorter bridge to TaskFlow adoption, because I'm
> a huge fan of the idea and would _start_ something with it in a heartbeat,
> but refactoring things to use it feels really weird to me.

I wonder how I can make that better...

Where the concepts that new/different? Maybe I just have more of a functional 
programming background and the way taskflow gets you to create tasks that are 
later executed, order them ahead of time, and then *later* run them is still a 
foreign concept for folks that have not done things with non-procedural 
languages. What were the confusion points, how may I help address them? More 
docs maybe, more examples, something else?

I would agree that the jobboard[0] concept is different than the other parts of 
taskflow, but it could be useful here:

Basically at its core its a application of zookeeper where 'jobs' are posted to 
a directory (using sequenced nodes in zookeeper, so that ordering is retained). 
Entities then acquire ephemeral locks on those 'jobs' (these locks will be 
released if the owner process disconnects, or fails...) and then work on the 
contents of that job (where contents can be pretty much arbitrary). This 
creates a highly available job queue (queue-like due to the node 
sequencing[1]), and it sounds pretty similar to what zaqar could provide in 
theory (except the zookeeper one is proven, battle-hardened, works and 
exists...). But we should of course continue being scared of zookeeper, because 
u know, who wants to use a tool where it would fit, haha (this is a joke).

[0] 
https://github.com/openstack/taskflow/blob/master/taskflow/jobs/jobboard.py#L25 

[1] 
http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming

> 
>> What was your metric for determining the code change either would have (out 
>> of curiosity)?
>> 
>> Perhaps u should look at [2], although I'm unclear on what the desired 
>> functionality is here.
>> 
>> Do u want the single engine to transfer its work to another engine when it 
>> 'goes down'? If so then the jobboard model + zookeper inherently does this.
>> 
>> Or maybe u want something else? I'm probably confused because u seem to be 
>> asking for resource timeouts + recover from engine failure (which seems like 
>> a liveness issue and not a resource timeout one), those 2 things seem 
>> separable.
>> 
> 
> I agree with you on this. It is definitely a liveness problem. The
> resource timeout isn't something I've seen discussed before. We do have
> a stack timeout, and we need to keep on honoring that, but we can do
> that with a job that sleeps for the stack timeout if we have a liveness
> guarantee that will resurrect the job (with the sleep shortened by the
> time since stack-update-time) somewhere else if the original engine
> can't complete the job.
> 
>> [1] http://docs.openstack.org/developer/taskflow/jobs.html
>> 
>> [2] 
>> http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
>> 
>> On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran 
>>  wrote:
>> 
>>> Hi all,
>>> 
>>> Convergence-POC distributes stack operations by sending resource actions 
>>> over RPC for any heat-engine to execute. Entire stack lifecycle will be 
>>> controlled by worker/observer notifications. This distributed model has its 
>>> own advantages and disadvantages.
>>> 
>>> Any stack operation has a timeout and a single engine will be responsible 
>>> for it. If that engine goes down, timeout is lost along with it. So a 
>>> traditional way is for other engines to recreate timeout from scratch. Also 
>>> a missed resource action notification will be detected only when stack 
>>> operation timeout happens.
>>> 
>>> To overcome this, we will need the following capability:
>>> 1.   Resource timeout (can be used for retry)
>>> 2.   Recover from engine failure (loss of stack timeout, resource 
>>> action notification)
>>> 
>>> 
>>> Suggestion:
>>> 1.   Use task queue like celery to host timeouts for both stack and 
>>> resource.
>>> 2.   Poll database for engine failures and restart timers/ retrigger 
>>> resource retry (IMHO: This would be a traditional and weighs heavy)
>>> 3.   Migrate heat to use TaskFlow. (Too many code change)
>>> 
>>> I am not suggesting we use Task Flow. Using celery will have very minimum 
>>> code change. (de

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Joshua Harlow
On Nov 13, 2014, at 10:59 AM, Clint Byrum  wrote:

> Excerpts from Zane Bitter's message of 2014-11-13 09:55:43 -0800:
>> On 13/11/14 09:58, Clint Byrum wrote:
>>> Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
 On 13/11/14 03:29, Murugan, Visnusaran wrote:
> Hi all,
> 
> Convergence-POC distributes stack operations by sending resource actions
> over RPC for any heat-engine to execute. Entire stack lifecycle will be
> controlled by worker/observer notifications. This distributed model has
> its own advantages and disadvantages.
> 
> Any stack operation has a timeout and a single engine will be
> responsible for it. If that engine goes down, timeout is lost along with
> it. So a traditional way is for other engines to recreate timeout from
> scratch. Also a missed resource action notification will be detected
> only when stack operation timeout happens.
> 
> To overcome this, we will need the following capability:
> 
> 1.Resource timeout (can be used for retry)
 
 I don't believe this is strictly needed for phase 1 (essentially we
 don't have it now, so nothing gets worse).
 
>>> 
>>> We do have a stack timeout, and it stands to reason that we won't have a
>>> single box with a timeout greenthread after this, so a strategy is
>>> needed.
>> 
>> Right, that was 2, but I was talking specifically about the resource 
>> retry. I think we agree on both points.
>> 
 For phase 2, yes, we'll want it. One thing we haven't discussed much is
 that if we used Zaqar for this then the observer could claim a message
 but not acknowledge it until it had processed it, so we could have
 guaranteed delivery.
 
>>> 
>>> Frankly, if oslo.messaging doesn't support reliable delivery then we
>>> need to add it.
>> 
>> That is straight-up impossible with AMQP. Either you ack the message and
>> risk losing it if the worker dies before processing is complete, or you 
>> don't ack the message until it's processed and you become a blocker for 
>> every other worker trying to pull jobs off the queue. It works fine when 
>> you have only one worker; otherwise not so much. This is the crux of the 
>> whole "why isn't Zaqar just Rabbit" debate.
>> 
> 
> I'm not sure we have the same understanding of AMQP, so hopefully we can
> clarify here. This stackoverflow answer echoes my understanding:
> 
> http://stackoverflow.com/questions/17841843/rabbitmq-does-one-consumer-block-the-other-consumers-of-the-same-queue
> 
> Not ack'ing just means they might get retransmitted if we never ack. It
> doesn't block other consumers. And as the link above quotes from the
> AMQP spec, when there are multiple consumers, FIFO is not guaranteed.
> Other consumers get other messages.
> 
> So just add the ability for a consumer to read, work, ack to
> oslo.messaging, and this is mostly handled via AMQP. Of course that
> also likely means no zeromq for Heat without accepting that messages
> may be lost if workers die.
> 
> Basically we need to add something that is not "RPC" but instead
> "jobqueue" that mimics this:
> 
> http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/rpc/dispatcher.py#n131
> 
> I've always been suspicious of this bit of code, as it basically means
> that if anything fails between that call, and the one below it, we have
> lost contact, but as long as clients are written to re-send when there
> is a lack of reply, there shouldn't be a problem. But, for a job queue,
> there is no reply, and so the worker would dispatch, and then
> acknowledge after the dispatched call had returned (including having
> completed the step where new messages are added to the queue for any
> newly-possible children).
> 
> Just to be clear, I believe what Zaqar adds is the ability to peek at
> a specific message ID and not affect it in the queue, which is entirely
> different than ACK'ing the ones you've already received in your session.
> 
>> Most stuff in OpenStack gets around this by doing synchronous calls 
>> across oslo.messaging, where there is an end-to-end ack. We don't want 
>> that here though. We'll probably have to make do with having ways to 
>> recover after a failure (kick off another update with the same data is 
>> always an option). The hard part is that if something dies we don't 
>> really want to wait until the stack timeout to start recovering.
>> 
> 
> I fully agree. Josh's point about using a coordination service like
> Zookeeper to maintain liveness is an interesting one here. If we just
> make sure that all the workers that have claimed work off the queue are
> alive, that should be sufficient to prevent a hanging stack situation
> like you describe above.
> 
>>> Zaqar should have nothing to do with this and is, IMO, a
>>> poor choice at this stage, though I like the idea of using it in the
>>> future so that we can make Heat more of an outside-the-cloud app.
>> 
>> I'm inclined to agree that it 

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-11-13 09:55:43 -0800:
> On 13/11/14 09:58, Clint Byrum wrote:
> > Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
> >> On 13/11/14 03:29, Murugan, Visnusaran wrote:
> >>> Hi all,
> >>>
> >>> Convergence-POC distributes stack operations by sending resource actions
> >>> over RPC for any heat-engine to execute. Entire stack lifecycle will be
> >>> controlled by worker/observer notifications. This distributed model has
> >>> its own advantages and disadvantages.
> >>>
> >>> Any stack operation has a timeout and a single engine will be
> >>> responsible for it. If that engine goes down, timeout is lost along with
> >>> it. So a traditional way is for other engines to recreate timeout from
> >>> scratch. Also a missed resource action notification will be detected
> >>> only when stack operation timeout happens.
> >>>
> >>> To overcome this, we will need the following capability:
> >>>
> >>> 1.Resource timeout (can be used for retry)
> >>
> >> I don't believe this is strictly needed for phase 1 (essentially we
> >> don't have it now, so nothing gets worse).
> >>
> >
> > We do have a stack timeout, and it stands to reason that we won't have a
> > single box with a timeout greenthread after this, so a strategy is
> > needed.
> 
> Right, that was 2, but I was talking specifically about the resource 
> retry. I think we agree on both points.
> 
> >> For phase 2, yes, we'll want it. One thing we haven't discussed much is
> >> that if we used Zaqar for this then the observer could claim a message
> >> but not acknowledge it until it had processed it, so we could have
> >> guaranteed delivery.
> >>
> >
> > Frankly, if oslo.messaging doesn't support reliable delivery then we
> > need to add it.
> 
> That is straight-up impossible with AMQP. Either you ack the message and 
> risk losing it if the worker dies before processing is complete, or you 
> don't ack the message until it's processed and you become a blocker for 
> every other worker trying to pull jobs off the queue. It works fine when 
> you have only one worker; otherwise not so much. This is the crux of the 
> whole "why isn't Zaqar just Rabbit" debate.
> 

I'm not sure we have the same understanding of AMQP, so hopefully we can
clarify here. This stackoverflow answer echoes my understanding:

http://stackoverflow.com/questions/17841843/rabbitmq-does-one-consumer-block-the-other-consumers-of-the-same-queue

Not ack'ing just means they might get retransmitted if we never ack. It
doesn't block other consumers. And as the link above quotes from the
AMQP spec, when there are multiple consumers, FIFO is not guaranteed.
Other consumers get other messages.

So just add the ability for a consumer to read, work, ack to
oslo.messaging, and this is mostly handled via AMQP. Of course that
also likely means no zeromq for Heat without accepting that messages
may be lost if workers die.

Basically we need to add something that is not "RPC" but instead
"jobqueue" that mimics this:

http://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/rpc/dispatcher.py#n131

I've always been suspicious of this bit of code, as it basically means
that if anything fails between that call, and the one below it, we have
lost contact, but as long as clients are written to re-send when there
is a lack of reply, there shouldn't be a problem. But, for a job queue,
there is no reply, and so the worker would dispatch, and then
acknowledge after the dispatched call had returned (including having
completed the step where new messages are added to the queue for any
newly-possible children).

Just to be clear, I believe what Zaqar adds is the ability to peek at
a specific message ID and not affect it in the queue, which is entirely
different than ACK'ing the ones you've already received in your session.

> Most stuff in OpenStack gets around this by doing synchronous calls 
> across oslo.messaging, where there is an end-to-end ack. We don't want 
> that here though. We'll probably have to make do with having ways to 
> recover after a failure (kick off another update with the same data is 
> always an option). The hard part is that if something dies we don't 
> really want to wait until the stack timeout to start recovering.
>

I fully agree. Josh's point about using a coordination service like
Zookeeper to maintain liveness is an interesting one here. If we just
make sure that all the workers that have claimed work off the queue are
alive, that should be sufficient to prevent a hanging stack situation
like you describe above.

> > Zaqar should have nothing to do with this and is, IMO, a
> > poor choice at this stage, though I like the idea of using it in the
> > future so that we can make Heat more of an outside-the-cloud app.
> 
> I'm inclined to agree that it would be hard to force operators to deploy 
> Zaqar in order to be able to deploy Heat, and that we should probably be 
> cautious for that reason.
> 
> That said, fr

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Nandavar, Divakar Padiyar
> Most stuff in OpenStack gets around this by doing synchronous calls across 
> oslo.messaging, where there is an end-to-end ack. We don't want that here > 
> >though. We'll probably have to make do with having ways to recover after a 
> failure (kick off another update with the same data is always an option). The 
> hard >part is that if something dies we don't really want to wait until the 
> stack timeout to start recovering.

We should be able to address this in convergence without having to wait for 
stack timeout.  This scenario would be similar to initiating the stack update 
while another large stack update is still progress.  We are looking into 
addressing this scenario.

Thanks,
Divakar

-Original Message-
From: Zane Bitter [mailto:zbit...@redhat.com] 
Sent: Thursday, November 13, 2014 11:26 PM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

On 13/11/14 09:58, Clint Byrum wrote:
> Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
>> On 13/11/14 03:29, Murugan, Visnusaran wrote:
>>> Hi all,
>>>
>>> Convergence-POC distributes stack operations by sending resource 
>>> actions over RPC for any heat-engine to execute. Entire stack 
>>> lifecycle will be controlled by worker/observer notifications. This 
>>> distributed model has its own advantages and disadvantages.
>>>
>>> Any stack operation has a timeout and a single engine will be 
>>> responsible for it. If that engine goes down, timeout is lost along 
>>> with it. So a traditional way is for other engines to recreate 
>>> timeout from scratch. Also a missed resource action notification 
>>> will be detected only when stack operation timeout happens.
>>>
>>> To overcome this, we will need the following capability:
>>>
>>> 1.Resource timeout (can be used for retry)
>>
>> I don't believe this is strictly needed for phase 1 (essentially we 
>> don't have it now, so nothing gets worse).
>>
>
> We do have a stack timeout, and it stands to reason that we won't have 
> a single box with a timeout greenthread after this, so a strategy is 
> needed.

Right, that was 2, but I was talking specifically about the resource retry. I 
think we agree on both points.

>> For phase 2, yes, we'll want it. One thing we haven't discussed much 
>> is that if we used Zaqar for this then the observer could claim a 
>> message but not acknowledge it until it had processed it, so we could 
>> have guaranteed delivery.
>>
>
> Frankly, if oslo.messaging doesn't support reliable delivery then we 
> need to add it.

That is straight-up impossible with AMQP. Either you ack the message and risk 
losing it if the worker dies before processing is complete, or you don't ack 
the message until it's processed and you become a blocker for every other 
worker trying to pull jobs off the queue. It works fine when you have only one 
worker; otherwise not so much. This is the crux of the whole "why isn't Zaqar 
just Rabbit" debate.

Most stuff in OpenStack gets around this by doing synchronous calls across 
oslo.messaging, where there is an end-to-end ack. We don't want that here 
though. We'll probably have to make do with having ways to recover after a 
failure (kick off another update with the same data is always an option). The 
hard part is that if something dies we don't really want to wait until the 
stack timeout to start recovering.



> Zaqar should have nothing to do with this and is, IMO, a poor choice 
> at this stage, though I like the idea of using it in the future so 
> that we can make Heat more of an outside-the-cloud app.

I'm inclined to agree that it would be hard to force operators to deploy Zaqar 
in order to be able to deploy Heat, and that we should probably be cautious for 
that reason.

That said, from a purely technical point of view it's not a poor choice at all 
- it has *exactly* the semantics we want (unlike AMQP), and at least to the 
extent that the operator wants to offer Zaqar to users anyway it completely 
eliminates a whole backend that they would otherwise have to deploy. It's a 
tragedy that all of OpenStack has not been designed to build upon itself in 
this way and it causes me physical pain to know that we're about to perpetuate 
it.

>>> 2.Recover from engine failure (loss of stack timeout, resource 
>>> action
>>> notification)
>>>
>>> Suggestion:
>>>
>>> 1.Use task queue like celery to host timeouts for both stack and resource.
>>
>> I believe Celery is more or less a non-starter as an OpenStack 
>> dependency 

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Zane Bitter

On 13/11/14 09:58, Clint Byrum wrote:

Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:

On 13/11/14 03:29, Murugan, Visnusaran wrote:

Hi all,

Convergence-POC distributes stack operations by sending resource actions
over RPC for any heat-engine to execute. Entire stack lifecycle will be
controlled by worker/observer notifications. This distributed model has
its own advantages and disadvantages.

Any stack operation has a timeout and a single engine will be
responsible for it. If that engine goes down, timeout is lost along with
it. So a traditional way is for other engines to recreate timeout from
scratch. Also a missed resource action notification will be detected
only when stack operation timeout happens.

To overcome this, we will need the following capability:

1.Resource timeout (can be used for retry)


I don't believe this is strictly needed for phase 1 (essentially we
don't have it now, so nothing gets worse).



We do have a stack timeout, and it stands to reason that we won't have a
single box with a timeout greenthread after this, so a strategy is
needed.


Right, that was 2, but I was talking specifically about the resource 
retry. I think we agree on both points.



For phase 2, yes, we'll want it. One thing we haven't discussed much is
that if we used Zaqar for this then the observer could claim a message
but not acknowledge it until it had processed it, so we could have
guaranteed delivery.



Frankly, if oslo.messaging doesn't support reliable delivery then we
need to add it.


That is straight-up impossible with AMQP. Either you ack the message and 
risk losing it if the worker dies before processing is complete, or you 
don't ack the message until it's processed and you become a blocker for 
every other worker trying to pull jobs off the queue. It works fine when 
you have only one worker; otherwise not so much. This is the crux of the 
whole "why isn't Zaqar just Rabbit" debate.


Most stuff in OpenStack gets around this by doing synchronous calls 
across oslo.messaging, where there is an end-to-end ack. We don't want 
that here though. We'll probably have to make do with having ways to 
recover after a failure (kick off another update with the same data is 
always an option). The hard part is that if something dies we don't 
really want to wait until the stack timeout to start recovering.



Zaqar should have nothing to do with this and is, IMO, a
poor choice at this stage, though I like the idea of using it in the
future so that we can make Heat more of an outside-the-cloud app.


I'm inclined to agree that it would be hard to force operators to deploy 
Zaqar in order to be able to deploy Heat, and that we should probably be 
cautious for that reason.


That said, from a purely technical point of view it's not a poor choice 
at all - it has *exactly* the semantics we want (unlike AMQP), and at 
least to the extent that the operator wants to offer Zaqar to users 
anyway it completely eliminates a whole backend that they would 
otherwise have to deploy. It's a tragedy that all of OpenStack has not 
been designed to build upon itself in this way and it causes me physical 
pain to know that we're about to perpetuate it.



2.Recover from engine failure (loss of stack timeout, resource action
notification)

Suggestion:

1.Use task queue like celery to host timeouts for both stack and resource.


I believe Celery is more or less a non-starter as an OpenStack
dependency because it uses Kombu directly to talk to the queue, vs.
oslo.messaging which is an abstraction layer over Kombu, Qpid, ZeroMQ
and maybe others in the future. i.e. requiring Celery means that some
users would be forced to install Rabbit for the first time.

One option would be to fork Celery and replace Kombu with oslo.messaging
as its abstraction layer. Good luck getting that maintained though,
since Celery _invented_ Kombu to be it's abstraction layer.



A slight side point here: Kombu supports Qpid and ZeroMQ. Oslo.messaging


You're right about Kombu supporting Qpid, it appears they added it. I 
don't see ZeroMQ on the list though:


http://kombu.readthedocs.org/en/latest/userguide/connections.html#transport-comparison


is more about having a unified API than a set of magic backends. It
actually boggles my mind why we didn't just use kombu (cue 20 reactions
with people saying it wasn't EXACTLY right), but I think we're committed


Well, we also have to take into account the fact that Qpid support was 
added only during the last 9 months, whereas oslo.messaging was 
implemented 3 years ago and time travel hasn't been invented yet (for 
any definition of 'yet').



to oslo.messaging now. Anyway, celery would need no such refactor, as
kombu would be able to access the same bus as everything else just fine.


Interesting, so that would make it easier to get Celery added to the 
global requirements, although we'd likely still have headaches to deal 
with around configuration.



2.Poll database for engine 

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Ryan Brown
On 11/13/2014 09:58 AM, Clint Byrum wrote:
> Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
>> On 13/11/14 03:29, Murugan, Visnusaran wrote:

> [snip]

>>> 3.Migrate heat to use TaskFlow. (Too many code change)
>>
>> If it's just handling timed triggers (maybe this is closer to #2) and 
>> not migrating the whole code base, then I don't see why it would be a 
>> big change (or even a change at all - it's basically new functionality). 
>> I'm not sure if TaskFlow has something like this already. If not we 
>> could also look at what Mistral is doing with timed tasks and see if we 
>> could spin some of it out into an Oslo library.
>>
> 
> I feel like it boils down to something running periodically checking for
> scheduled tasks that are due to run but have not run yet. I wonder if we
> can actually look at Ironic for how they do this, because Ironic polls
> power state of machines constantly, and uses a hash ring to make sure
> only one conductor is polling any one machine at a time. If we broke
> stacks up into a hash ring like that for the purpose of singleton tasks
> like timeout checking, that might work out nicely.

+1

Using a hash ring is a great way to shard tasks. I think the most
sensible way to add this would be to make timeout polling a
responsibility of the Observer instead of the engine.

-- 
Ryan Brown / Software Engineer, Openstack / Red Hat, Inc.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
> A question;
> 
> How is using something like celery in heat vs taskflow in heat (or at least 
> concept [1]) 'to many code change'.
> 
> Both seem like change of similar levels ;-)
> 

I've tried a few times to dive into refactoring some things to use
TaskFlow at a shallow level, and have always gotten confused and
frustrated.

The amount of lines that are changed probably is the same. But the
massive shift in thinking is not an easy one to make. It may be worth some
thinking on providing a shorter bridge to TaskFlow adoption, because I'm
a huge fan of the idea and would _start_ something with it in a heartbeat,
but refactoring things to use it feels really weird to me.

> What was your metric for determining the code change either would have (out 
> of curiosity)?
> 
> Perhaps u should look at [2], although I'm unclear on what the desired 
> functionality is here.
> 
> Do u want the single engine to transfer its work to another engine when it 
> 'goes down'? If so then the jobboard model + zookeper inherently does this.
> 
> Or maybe u want something else? I'm probably confused because u seem to be 
> asking for resource timeouts + recover from engine failure (which seems like 
> a liveness issue and not a resource timeout one), those 2 things seem 
> separable.
> 

I agree with you on this. It is definitely a liveness problem. The
resource timeout isn't something I've seen discussed before. We do have
a stack timeout, and we need to keep on honoring that, but we can do
that with a job that sleeps for the stack timeout if we have a liveness
guarantee that will resurrect the job (with the sleep shortened by the
time since stack-update-time) somewhere else if the original engine
can't complete the job.

> [1] http://docs.openstack.org/developer/taskflow/jobs.html
> 
> [2] 
> http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
> 
> On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran  
> wrote:
> 
> > Hi all,
> >  
> > Convergence-POC distributes stack operations by sending resource actions 
> > over RPC for any heat-engine to execute. Entire stack lifecycle will be 
> > controlled by worker/observer notifications. This distributed model has its 
> > own advantages and disadvantages.
> >  
> > Any stack operation has a timeout and a single engine will be responsible 
> > for it. If that engine goes down, timeout is lost along with it. So a 
> > traditional way is for other engines to recreate timeout from scratch. Also 
> > a missed resource action notification will be detected only when stack 
> > operation timeout happens.
> >  
> > To overcome this, we will need the following capability:
> > 1.   Resource timeout (can be used for retry)
> > 2.   Recover from engine failure (loss of stack timeout, resource 
> > action notification)
> >  
> >  
> > Suggestion:
> > 1.   Use task queue like celery to host timeouts for both stack and 
> > resource.
> > 2.   Poll database for engine failures and restart timers/ retrigger 
> > resource retry (IMHO: This would be a traditional and weighs heavy)
> > 3.   Migrate heat to use TaskFlow. (Too many code change)
> >  
> > I am not suggesting we use Task Flow. Using celery will have very minimum 
> > code change. (decorate appropriate functions)
> >  
> >  
> > Your thoughts.
> >  
> > -Vishnu
> > IRC: ckmvishnu
> > ___
> > OpenStack-dev mailing list
> > OpenStack-dev@lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Jastrzebski, Michal
By observer I mean process which will actually notify about stack timeout. 
Maybe it was poor  choice of words. Anyway, something will need to check what 
stacks are timed out, and that's new single point of failure.

> -Original Message-
> From: Zane Bitter [mailto:zbit...@redhat.com]
> Sent: Thursday, November 13, 2014 3:49 PM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
> 
> On 13/11/14 09:31, Jastrzebski, Michal wrote:
> > Guys, I don't think we want to get into this cluster management mud.
> > You say let's make observer...and what if observer dies? Do we do
> > observer to observer? And then there is split brain. I'm observer, I've lost
> connection to worker. Should I restart a worker?
> > Maybe I'm one who lost connection to the rest of the world? Should I
> > resume task and risk duplicate workload?
> 
> I think you're misinterpreting what we mean by "observer". See
> https://wiki.openstack.org/wiki/Heat/ConvergenceDesign
> 
> - ZB
> 
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Clint Byrum
Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
> On 13/11/14 03:29, Murugan, Visnusaran wrote:
> > Hi all,
> >
> > Convergence-POC distributes stack operations by sending resource actions
> > over RPC for any heat-engine to execute. Entire stack lifecycle will be
> > controlled by worker/observer notifications. This distributed model has
> > its own advantages and disadvantages.
> >
> > Any stack operation has a timeout and a single engine will be
> > responsible for it. If that engine goes down, timeout is lost along with
> > it. So a traditional way is for other engines to recreate timeout from
> > scratch. Also a missed resource action notification will be detected
> > only when stack operation timeout happens.
> >
> > To overcome this, we will need the following capability:
> >
> > 1.Resource timeout (can be used for retry)
> 
> I don't believe this is strictly needed for phase 1 (essentially we 
> don't have it now, so nothing gets worse).
> 

We do have a stack timeout, and it stands to reason that we won't have a
single box with a timeout greenthread after this, so a strategy is
needed.

> For phase 2, yes, we'll want it. One thing we haven't discussed much is 
> that if we used Zaqar for this then the observer could claim a message 
> but not acknowledge it until it had processed it, so we could have 
> guaranteed delivery.
>

Frankly, if oslo.messaging doesn't support reliable delivery then we
need to add it. Zaqar should have nothing to do with this and is, IMO, a
poor choice at this stage, though I like the idea of using it in the
future so that we can make Heat more of an outside-the-cloud app.

> > 2.Recover from engine failure (loss of stack timeout, resource action
> > notification)
> >
> > Suggestion:
> >
> > 1.Use task queue like celery to host timeouts for both stack and resource.
> 
> I believe Celery is more or less a non-starter as an OpenStack 
> dependency because it uses Kombu directly to talk to the queue, vs. 
> oslo.messaging which is an abstraction layer over Kombu, Qpid, ZeroMQ 
> and maybe others in the future. i.e. requiring Celery means that some 
> users would be forced to install Rabbit for the first time.
>
> One option would be to fork Celery and replace Kombu with oslo.messaging 
> as its abstraction layer. Good luck getting that maintained though, 
> since Celery _invented_ Kombu to be it's abstraction layer.
> 

A slight side point here: Kombu supports Qpid and ZeroMQ. Oslo.messaging
is more about having a unified API than a set of magic backends. It
actually boggles my mind why we didn't just use kombu (cue 20 reactions
with people saying it wasn't EXACTLY right), but I think we're committed
to oslo.messaging now. Anyway, celery would need no such refactor, as
kombu would be able to access the same bus as everything else just fine.

> > 2.Poll database for engine failures and restart timers/ retrigger
> > resource retry (IMHO: This would be a traditional and weighs heavy)
> >
> > 3.Migrate heat to use TaskFlow. (Too many code change)
> 
> If it's just handling timed triggers (maybe this is closer to #2) and 
> not migrating the whole code base, then I don't see why it would be a 
> big change (or even a change at all - it's basically new functionality). 
> I'm not sure if TaskFlow has something like this already. If not we 
> could also look at what Mistral is doing with timed tasks and see if we 
> could spin some of it out into an Oslo library.
> 

I feel like it boils down to something running periodically checking for
scheduled tasks that are due to run but have not run yet. I wonder if we
can actually look at Ironic for how they do this, because Ironic polls
power state of machines constantly, and uses a hash ring to make sure
only one conductor is polling any one machine at a time. If we broke
stacks up into a hash ring like that for the purpose of singleton tasks
like timeout checking, that might work out nicely.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Zane Bitter

On 13/11/14 09:31, Jastrzebski, Michal wrote:

Guys, I don't think we want to get into this cluster management mud. You say 
let's
make observer...and what if observer dies? Do we do observer to observer? And 
then
there is split brain. I'm observer, I've lost connection to worker. Should I 
restart a worker?
Maybe I'm one who lost connection to the rest of the world? Should I resume 
task and risk
duplicate workload?


I think you're misinterpreting what we mean by "observer". See 
https://wiki.openstack.org/wiki/Heat/ConvergenceDesign


- ZB

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Jastrzebski, Michal
Guys, I don't think we want to get into this cluster management mud. You say 
let's
make observer...and what if observer dies? Do we do observer to observer? And 
then
there is split brain. I'm observer, I've lost connection to worker. Should I 
restart a worker?
Maybe I'm one who lost connection to the rest of the world? Should I resume 
task and risk
duplicate workload?

And then there is another problem. If there is timeout caused by limit of 
resources of workers,
if  we restart whole workload after timeout, we will stretch these resources 
even further, and in turn
we'll get more timeouts (...) <- great way to kill whole setup.

So we get to horizontal scalability. Or total lack of it. Any stack that is too 
complicated for single engine
to process will be impossible to process at all. We should find a way to 
distribute workloads in
active-active, stateless (as much as possible) manner.

Regards,
Michał "inc0" Jastrzębski   

> -Original Message-
> From: Murugan, Visnusaran [mailto:visnusaran.muru...@hp.com]
> Sent: Thursday, November 13, 2014 2:59 PM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
> 
> Zane,
> 
> We do follow shardy's suggestion of having worker/observer as eventlet in
> heat-engine. No new process. The timer will be executed under an engine's
> worker.
> 
> Question:
> 1. heat-engine processing resource-action failed (process killed) 2. heat-
> engine processing timeout for a stack fails (process killed)
> 
> In the above mentioned cases, I thought celery tasks would come to our
> rescue.
> 
> Convergence-poc implementation can recover from error and retry if there is
> a notification available.
> 
> 
> -Vishnu
> 
> -Original Message-
> From: Zane Bitter [mailto:zbit...@redhat.com]
> Sent: Thursday, November 13, 2014 7:05 PM
> To: openstack-dev@lists.openstack.org
> Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops
> 
> On 13/11/14 06:52, Angus Salkeld wrote:
> > On Thu, Nov 13, 2014 at 6:29 PM, Murugan, Visnusaran
> > mailto:visnusaran.muru...@hp.com>>
> wrote:
> >
> > Hi all,
> >
> > __ __
> >
> > Convergence-POC distributes stack operations by sending resource
> > actions over RPC for any heat-engine to execute. Entire stack
> > lifecycle will be controlled by worker/observer notifications. This
> > distributed model has its own advantages and disadvantages.
> >
> > __ __
> >
> > Any stack operation has a timeout and a single engine will be
> > responsible for it. If that engine goes down, timeout is lost along
> > with it. So a traditional way is for other engines to recreate
> > timeout from scratch. Also a missed resource action notification
> > will be detected only when stack operation timeout happens. __ __
> >
> > __ __
> >
> > To overcome this, we will need the following capability:
> >
> > __1.__Resource timeout (can be used for retry)
> >
> > We will shortly have a worker job, can't we have a job that just
> > sleeps that gets started in parallel with the job that is doing the work?
> > It gets to the end of the sleep and runs a check.
> 
> What if that worker dies too? There's no guarantee that it'd even be a
> different worker. In fact, there's not even a guarantee that we'd have
> multiple workers.
> 
> BTW Steve Hardy's suggestion, which I have more or less come around to, is
> that the engines themselves should be the workers in convergence, to save
> operators deploying two types of processes. (The observers will still be a
> separate process though, in phase 2.)
> 
> > 
> >
> > __2.__Recover from engine failure (loss of stack timeout, resource
> > action notification)
> >
> > __
> >
> >
> > My suggestion above could catch failures as long as it was run in a
> > different process.
> >
> > -Angus
> >
> > __
> >
> > __ __
> >
> > Suggestion:
> >
> > __1.__Use task queue like celery to host timeouts for both stack and
> > resource.
> >
> > __2.__Poll database for engine failures and restart timers/
> > retrigger resource retry (IMHO: This would be a traditional and
> > weighs heavy)
> >
> > __3.__Migrate heat to use TaskFlow. (Too many code change)
> >
> > __ __
> >
> > I am not suggesting we use Task Flow. Using ce

Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Murugan, Visnusaran
Zane,

We do follow shardy's suggestion of having worker/observer as eventlet in 
heat-engine. No new process. The timer will be executed under an engine's 
worker.

Question:
1. heat-engine processing resource-action failed (process killed)
2. heat-engine processing timeout for a stack fails (process killed)

In the above mentioned cases, I thought celery tasks would come to our rescue.

Convergence-poc implementation can recover from error and retry if there is a 
notification available.


-Vishnu

-Original Message-
From: Zane Bitter [mailto:zbit...@redhat.com] 
Sent: Thursday, November 13, 2014 7:05 PM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

On 13/11/14 06:52, Angus Salkeld wrote:
> On Thu, Nov 13, 2014 at 6:29 PM, Murugan, Visnusaran 
> mailto:visnusaran.muru...@hp.com>> wrote:
>
> Hi all,
>
> __ __
>
> Convergence-POC distributes stack operations by sending resource
> actions over RPC for any heat-engine to execute. Entire stack
> lifecycle will be controlled by worker/observer notifications. This
> distributed model has its own advantages and disadvantages.
>
> __ __
>
> Any stack operation has a timeout and a single engine will be
> responsible for it. If that engine goes down, timeout is lost along
> with it. So a traditional way is for other engines to recreate
> timeout from scratch. Also a missed resource action notification
> will be detected only when stack operation timeout happens. __ __
>
> __ __
>
> To overcome this, we will need the following capability:
>
> __1.__Resource timeout (can be used for retry)
>
> We will shortly have a worker job, can't we have a job that just 
> sleeps that gets started in parallel with the job that is doing the work?
> It gets to the end of the sleep and runs a check.

What if that worker dies too? There's no guarantee that it'd even be a 
different worker. In fact, there's not even a guarantee that we'd have multiple 
workers.

BTW Steve Hardy's suggestion, which I have more or less come around to, is that 
the engines themselves should be the workers in convergence, to save operators 
deploying two types of processes. (The observers will still be a separate 
process though, in phase 2.)

> 
>
> __2.__Recover from engine failure (loss of stack timeout, resource
> action notification)
>
> __
>
>
> My suggestion above could catch failures as long as it was run in a 
> different process.
>
> -Angus
>
> __
>
> __ __
>
> Suggestion:
>
> __1.__Use task queue like celery to host timeouts for both stack and
> resource.
>
> __2.__Poll database for engine failures and restart timers/
> retrigger resource retry (IMHO: This would be a traditional and
> weighs heavy)
>
> __3.__Migrate heat to use TaskFlow. (Too many code change)
>
> __ __
>
> I am not suggesting we use Task Flow. Using celery will have very
> minimum code change. (decorate appropriate functions) 
>
> __ __
>
> __ __
>
> Your thoughts.
>
> __ __
>
> -Vishnu
>
> IRC: ckmvishnu
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> <mailto:OpenStack-dev@lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Zane Bitter

On 13/11/14 03:29, Murugan, Visnusaran wrote:

Hi all,

Convergence-POC distributes stack operations by sending resource actions
over RPC for any heat-engine to execute. Entire stack lifecycle will be
controlled by worker/observer notifications. This distributed model has
its own advantages and disadvantages.

Any stack operation has a timeout and a single engine will be
responsible for it. If that engine goes down, timeout is lost along with
it. So a traditional way is for other engines to recreate timeout from
scratch. Also a missed resource action notification will be detected
only when stack operation timeout happens.

To overcome this, we will need the following capability:

1.Resource timeout (can be used for retry)


I don't believe this is strictly needed for phase 1 (essentially we 
don't have it now, so nothing gets worse).


For phase 2, yes, we'll want it. One thing we haven't discussed much is 
that if we used Zaqar for this then the observer could claim a message 
but not acknowledge it until it had processed it, so we could have 
guaranteed delivery.



2.Recover from engine failure (loss of stack timeout, resource action
notification)

Suggestion:

1.Use task queue like celery to host timeouts for both stack and resource.


I believe Celery is more or less a non-starter as an OpenStack 
dependency because it uses Kombu directly to talk to the queue, vs. 
oslo.messaging which is an abstraction layer over Kombu, Qpid, ZeroMQ 
and maybe others in the future. i.e. requiring Celery means that some 
users would be forced to install Rabbit for the first time.


One option would be to fork Celery and replace Kombu with oslo.messaging 
as its abstraction layer. Good luck getting that maintained though, 
since Celery _invented_ Kombu to be it's abstraction layer.



2.Poll database for engine failures and restart timers/ retrigger
resource retry (IMHO: This would be a traditional and weighs heavy)

3.Migrate heat to use TaskFlow. (Too many code change)


If it's just handling timed triggers (maybe this is closer to #2) and 
not migrating the whole code base, then I don't see why it would be a 
big change (or even a change at all - it's basically new functionality). 
I'm not sure if TaskFlow has something like this already. If not we 
could also look at what Mistral is doing with timed tasks and see if we 
could spin some of it out into an Oslo library.


cheers,
Zane.


I am not suggesting we use Task Flow. Using celery will have very
minimum code change. (decorate appropriate functions)

Your thoughts.

-Vishnu

IRC: ckmvishnu



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Zane Bitter

On 13/11/14 06:52, Angus Salkeld wrote:

On Thu, Nov 13, 2014 at 6:29 PM, Murugan, Visnusaran
mailto:visnusaran.muru...@hp.com>> wrote:

Hi all,

__ __

Convergence-POC distributes stack operations by sending resource
actions over RPC for any heat-engine to execute. Entire stack
lifecycle will be controlled by worker/observer notifications. This
distributed model has its own advantages and disadvantages.

__ __

Any stack operation has a timeout and a single engine will be
responsible for it. If that engine goes down, timeout is lost along
with it. So a traditional way is for other engines to recreate
timeout from scratch. Also a missed resource action notification
will be detected only when stack operation timeout happens. __ __

__ __

To overcome this, we will need the following capability:

__1.__Resource timeout (can be used for retry)

We will shortly have a worker job, can't we have a job that just sleeps
that gets started in parallel with the job that is doing the work?
It gets to the end of the sleep and runs a check.


What if that worker dies too? There's no guarantee that it'd even be a 
different worker. In fact, there's not even a guarantee that we'd have 
multiple workers.


BTW Steve Hardy's suggestion, which I have more or less come around to, 
is that the engines themselves should be the workers in convergence, to 
save operators deploying two types of processes. (The observers will 
still be a separate process though, in phase 2.)





__2.__Recover from engine failure (loss of stack timeout, resource
action notification)

__


My suggestion above could catch failures as long as it was run in a
different process.

-Angus

__

__ __

Suggestion:

__1.__Use task queue like celery to host timeouts for both stack and
resource.

__2.__Poll database for engine failures and restart timers/
retrigger resource retry (IMHO: This would be a traditional and
weighs heavy)

__3.__Migrate heat to use TaskFlow. (Too many code change)

__ __

I am not suggesting we use Task Flow. Using celery will have very
minimum code change. (decorate appropriate functions) 

__ __

__ __

Your thoughts.

__ __

-Vishnu

IRC: ckmvishnu


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org

http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Murugan, Visnusaran
Parallel worker was what I initially thought. But what to do if the engine 
hosting that worker goes down?

-Vishnu

From: Angus Salkeld [mailto:asalk...@mirantis.com]
Sent: Thursday, November 13, 2014 5:22 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

On Thu, Nov 13, 2014 at 6:29 PM, Murugan, Visnusaran 
mailto:visnusaran.muru...@hp.com>> wrote:
Hi all,

Convergence-POC distributes stack operations by sending resource actions over 
RPC for any heat-engine to execute. Entire stack lifecycle will be controlled 
by worker/observer notifications. This distributed model has its own advantages 
and disadvantages.

Any stack operation has a timeout and a single engine will be responsible for 
it. If that engine goes down, timeout is lost along with it. So a traditional 
way is for other engines to recreate timeout from scratch. Also a missed 
resource action notification will be detected only when stack operation timeout 
happens.

To overcome this, we will need the following capability:

1.   Resource timeout (can be used for retry)
We will shortly have a worker job, can't we have a job that just sleeps that 
gets started in parallel with the job that is doing the work?
It gets to the end of the sleep and runs a check.

2.   Recover from engine failure (loss of stack timeout, resource action 
notification)


My suggestion above could catch failures as long as it was run in a different 
process.
-Angus


Suggestion:

1.   Use task queue like celery to host timeouts for both stack and 
resource.

2.   Poll database for engine failures and restart timers/ retrigger 
resource retry (IMHO: This would be a traditional and weighs heavy)

3.   Migrate heat to use TaskFlow. (Too many code change)

I am not suggesting we use Task Flow. Using celery will have very minimum code 
change. (decorate appropriate functions)


Your thoughts.

-Vishnu
IRC: ckmvishnu

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org<mailto:OpenStack-dev@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Angus Salkeld
On Thu, Nov 13, 2014 at 6:29 PM, Murugan, Visnusaran <
visnusaran.muru...@hp.com> wrote:

>  Hi all,
>
>
>
> Convergence-POC distributes stack operations by sending resource actions
> over RPC for any heat-engine to execute. Entire stack lifecycle will be
> controlled by worker/observer notifications. This distributed model has its
> own advantages and disadvantages.
>
>
>
> Any stack operation has a timeout and a single engine will be responsible
> for it. If that engine goes down, timeout is lost along with it. So a
> traditional way is for other engines to recreate timeout from scratch. Also
> a missed resource action notification will be detected only when stack
> operation timeout happens.
>
>
>
> To overcome this, we will need the following capability:
>
> 1.   Resource timeout (can be used for retry)
>
We will shortly have a worker job, can't we have a job that just sleeps
that gets started in parallel with the job that is doing the work?
It gets to the end of the sleep and runs a check.

>  2.   Recover from engine failure (loss of stack timeout, resource
> action notification)
>
>
>

My suggestion above could catch failures as long as it was run in a
different process.

-Angus


>
>
> Suggestion:
>
> 1.   Use task queue like celery to host timeouts for both stack and
> resource.
>
> 2.   Poll database for engine failures and restart timers/ retrigger
> resource retry (IMHO: This would be a traditional and weighs heavy)
>
> 3.   Migrate heat to use TaskFlow. (Too many code change)
>
>
>
> I am not suggesting we use Task Flow. Using celery will have very minimum
> code change. (decorate appropriate functions)
>
>
>
>
>
> Your thoughts.
>
>
>
> -Vishnu
>
> IRC: ckmvishnu
>
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Murugan, Visnusaran
Hi,

Intension is not to transfer work load of a failed engine onto an active one. 
Convergence implementation that we are working on will be able to recover from 
a failure, provided a timeout notification hits heat-engine. All I want is a 
safe holding area for my timeout tasks. Timeout can be a stack timeout or a 
resource timeout.

By code change :) I meant posting to a job queue will be a matter of decorating 
timeout method and firing it for a delayed execution. Felt that we need not use 
taskflow just for posting a delayed execution(timer in our case).

Correct me if I'm wrong.

-Vishnu

From: Joshua Harlow [mailto:harlo...@outlook.com]
Sent: Thursday, November 13, 2014 2:15 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

A question;

How is using something like celery in heat vs taskflow in heat (or at least 
concept [1]) 'to many code change'.

Both seem like change of similar levels ;-)

What was your metric for determining the code change either would have (out of 
curiosity)?

Perhaps u should look at [2], although I'm unclear on what the desired 
functionality is here.

Do u want the single engine to transfer its work to another engine when it 
'goes down'? If so then the jobboard model + zookeper inherently does this.

Or maybe u want something else? I'm probably confused because u seem to be 
asking for resource timeouts + recover from engine failure (which seems like a 
liveness issue and not a resource timeout one), those 2 things seem separable.

[1] http://docs.openstack.org/developer/taskflow/jobs.html

[2] 
http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple

On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran 
mailto:visnusaran.muru...@hp.com>> wrote:


Hi all,

Convergence-POC distributes stack operations by sending resource actions over 
RPC for any heat-engine to execute. Entire stack lifecycle will be controlled 
by worker/observer notifications. This distributed model has its own advantages 
and disadvantages.

Any stack operation has a timeout and a single engine will be responsible for 
it. If that engine goes down, timeout is lost along with it. So a traditional 
way is for other engines to recreate timeout from scratch. Also a missed 
resource action notification will be detected only when stack operation timeout 
happens.

To overcome this, we will need the following capability:
1.   Resource timeout (can be used for retry)
2.   Recover from engine failure (loss of stack timeout, resource action 
notification)


Suggestion:
1.   Use task queue like celery to host timeouts for both stack and 
resource.
2.   Poll database for engine failures and restart timers/ retrigger 
resource retry (IMHO: This would be a traditional and weighs heavy)
3.   Migrate heat to use TaskFlow. (Too many code change)

I am not suggesting we use Task Flow. Using celery will have very minimum code 
change. (decorate appropriate functions)


Your thoughts.

-Vishnu
IRC: ckmvishnu
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org<mailto:OpenStack-dev@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

2014-11-13 Thread Joshua Harlow
A question;

How is using something like celery in heat vs taskflow in heat (or at least 
concept [1]) 'to many code change'.

Both seem like change of similar levels ;-)

What was your metric for determining the code change either would have (out of 
curiosity)?

Perhaps u should look at [2], although I'm unclear on what the desired 
functionality is here.

Do u want the single engine to transfer its work to another engine when it 
'goes down'? If so then the jobboard model + zookeper inherently does this.

Or maybe u want something else? I'm probably confused because u seem to be 
asking for resource timeouts + recover from engine failure (which seems like a 
liveness issue and not a resource timeout one), those 2 things seem separable.

[1] http://docs.openstack.org/developer/taskflow/jobs.html

[2] 
http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple

On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran  
wrote:

> Hi all,
>  
> Convergence-POC distributes stack operations by sending resource actions over 
> RPC for any heat-engine to execute. Entire stack lifecycle will be controlled 
> by worker/observer notifications. This distributed model has its own 
> advantages and disadvantages.
>  
> Any stack operation has a timeout and a single engine will be responsible for 
> it. If that engine goes down, timeout is lost along with it. So a traditional 
> way is for other engines to recreate timeout from scratch. Also a missed 
> resource action notification will be detected only when stack operation 
> timeout happens.
>  
> To overcome this, we will need the following capability:
> 1.   Resource timeout (can be used for retry)
> 2.   Recover from engine failure (loss of stack timeout, resource action 
> notification)
>  
>  
> Suggestion:
> 1.   Use task queue like celery to host timeouts for both stack and 
> resource.
> 2.   Poll database for engine failures and restart timers/ retrigger 
> resource retry (IMHO: This would be a traditional and weighs heavy)
> 3.   Migrate heat to use TaskFlow. (Too many code change)
>  
> I am not suggesting we use Task Flow. Using celery will have very minimum 
> code change. (decorate appropriate functions)
>  
>  
> Your thoughts.
>  
> -Vishnu
> IRC: ckmvishnu
> ___
> OpenStack-dev mailing list
> OpenStack-dev@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev