Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-11-03 Thread Steven Dake

Sandy,

Apologies for not responding earlier, I am on vacation ATM. Responses 
inline.


On 10/31/2013 08:13 AM, Sandy Walsh wrote:


On 10/30/2013 08:08 PM, Steven Dake wrote:

On 10/30/2013 12:20 PM, Sandy Walsh wrote:

On 10/30/2013 03:10 PM, Steven Dake wrote:

I will -2 any patch that adds zookeeper as a dependency to Heat.

Certainly any distributed locking solution should be plugin based and
optional. Just as a database-oriented solution could be the default
plugin.

Sandy,

Even if it is optional, some percentage of the userbase will enable it
and expect the Heat community to debug and support it.

But, that's the nature of every openstack project. I don't support
HyperV in Nova or HBase in Ceilometer. The implementers deal with that
support. I can help guide someone to those people but have no intentions
of standing up those environments.


The HyperV scenario is different then the heat scenario.  If someone 
uses HyperV in Nova, they are pretty much on their own because of their 
choice to support a MS based virt infrastructure.  HyperV is not a 
choice many OpenStack deployments (if any...) will make.


In the case of Heat, the default choice will be zookeeper, because it 
doesn't suffer from dead locks problem.  Unfortunately these users will 
not recognize the problems that come with zookeeper (HA, scalability, 
security, documentation, different runtime environment) until it is too 
late to alter the previous decision that was made (We must have 
zookeeper!!).


I really don't think most people will *use* Nova with HyperV even though 
it is optional.  I do believe, however, that if zookeeper were 
optional it would become the default way to deploy Heat. This 
naturally leads to the Heat developers dealing with the ensuing chaos 
for one very specific problem that can be solved using alternative methods.


When users of OpenStack choose a db (postgres or mysql), or amqp (qpid 
or rabbit), the entire OpenStack community is able to respond with 
support, rather then one program (heat) in the hypothetical case of 
Zookeeper.



Re: the Java issue, we already have optional components in other
languages. I know Java is a different league of pain, but if it's an
optional component and left as a choice of the deployer, should we care?

-S

PS As an aside, what are your issues with ZK?



I realize zookeeper exists for a reason.  But unfortunately Zookeeper is
a server, rather then an in-process library.  This means someone needs
to figure out how to document, scale, secure, and provide high
availability for this component.

Yes, that's why we would use it. Same goes for rabbit and mysql.


The pain is not worth the gain.  A better solution would be to avoid 
locking all-together.



This is extremely challenging for the
two server infrastructure components OpenStack server processes depend
on today (AMQP, SQL).  If the entire OpenStack community saw value in
biting the bullet and accepting zookeeper as a dependency and taking on
this work, I might be more ameniable.

Why do other services need to agree on adopting ZK? If some Heat users
need it, they can use it. Nova shouldn't care.
And the Heat devs are the only folks with any responsibility to support 
it in this scenario.  If we are going to bring the pain, we should share 
it equally between programs :)



What we are talking about in the
review, however, is that the Heat team bite that bullet, which is a big
addition to the scope of work we already execute for the ability to gain
a distributed lock.  I would expect there are simpler approaches to
solve the problem without dragging the baggage of a new server component
into the OpenStack deployment.

Yes, there probably are, and alternatives are good. But, as others have
attested, ZK is tried and true. Why not support it also?


Choices are fundamental in life, and I generally disagree with limiting 
choices for people, including our target user base. However, I'd prefer 
us investigate choices that do not negatively impact one group of folks 
(the Heat developers and the various downstreams) in a profound way 
before we give up and say too hard.


As an example of the thought processes that went on after zookeeper was 
proposed, heat-core devs were talking about introducing a zookeeper 
tempest gate for Heat.  In my mind if we gate it, we support it.  This 
leads to the natural conclusion that it must be securi-fied, HA-ified, 
documenti-fied, and scalei-fied).  Groan.


See how one optional feature spins out of control?


Using zookeeper as is suggested in the review is far different then the
way Nova uses Zookeeper.  With the Nova use case, Nova still operates
just dandy without zookeeper.  With zookeeper in the Heat usecase, it
essentially becomes the default way people are expected to deploy Heat.

Why, if it's a plugin?

explained above.

What I would prefer is taskflow over AMQP, to leverage existing server
infrastructure (that has already been documented, scaled, secured, and
HA-ified).

Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-11-01 Thread Julien Danjou
On Thu, Oct 31 2013, Monty Taylor wrote:

 Sigh.

 Yay We've added more competing methods of complexity!!!

 Seriously. We now think that rabbit and zookeeper and mysql are ALL needed?

Yes, if you got synchronization problem that Paxos can resolve,
leveraging ZooKeeper is a good idea IMHO.
Depending _always_ on ZooKeeper is maybe not the best call, that's why
I've got on my mind to propose a library in Oslo providing several
drivers solving this synchronization issue. Where one of the driver
could be ZK based.

As for MySQL, let be reassured, it's not needed, you can use PostgreSQL.
;-)

-- 
Julien Danjou
# Free Software hacker # independent consultant
# http://julien.danjou.info


signature.asc
Description: PGP signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Joshua Harlow
I looked at it once, but noticed its piggy backing on DNS infrastructure and 
then wondered who has actually used this in production and didn't find much 
info (except that it's a cornell project). 

I never quite did get around to setting it up. Maybe u will have better luck :)

Sent from my really tiny device...

 On Oct 30, 2013, at 12:32 PM, Jay Pipes jaypi...@gmail.com wrote:
 
 Has anyone looked into using concoord for distributed locking?
 
 https://pypi.python.org/pypi/concoord
 
 Best,
 -jay
 
 On 10/30/2013 02:39 PM, Joshua Harlow wrote:
 So my idea here was that to break the abstraction for heat into 3 parts.
 
 Pardon my lack of heat terminology/knowledge if I miss something.
 
 1. The thing that receives the API request (I would assume an api server
 here).
 
 I would expect #1 to parse something into a known internal format. Whether
 this is tasks or jobs or something is up to heat, so this might of been my
 lack of understanding heat concepts here, but usually an API request
 translates into some internal format. Maybe this is the parser or
 something else (not sure really).
 
 Lets assume for now that it parses the API request into some tasks + flow
 (what taskflow provides).
 
 So then it becomes a question of how what do u do with those tasks  flows
 (what I call stage #2).
 
 - https://wiki.openstack.org/wiki/TaskFlow#Two_thousand_foot_view
 
 To me this is where taskflow 'shines' in that it has an engine concept
 which can run in various manners (the tasks and flow are not strongly
 associated with a engine). One of these engines is planned to be a
 distributed one (but its not the only one) and with that engine type it
 would have to interact with some type of job management system (or it
 would have to provide that job management system - or a simple version
 itself), but the difference is that the about tasks and flows (and the
 links/structure between them) is still disconnected from the actual engine
 that runs those tasks  flows. So this to mean means that there is
 plugabbility with regard to execution, which I think is pretty great.
 
 If that requires rework of the heat model, way of running, maybe its for
 the better? Idk.
 
 As taskflow is still newish, and most projects in openstack have there own
 distributed model (conductors, rpc process separation), we wanted to focus
 on having the basic principles down, and the review
 https://review.openstack.org/#/c/47609/ I am very grateful for jessica
 working her hardest to get that in a nearly there state. So yes, taskflow
 will continue on the path/spirit of 47609, and contributions are welcome
 of course :-)
 
 Feel free to also jump on #openstack-state-management since it might be
 easier to just chat there in the end with other interested parties.
 
 -Josh
 
 On 10/30/13 11:10 AM, Steven Dake sd...@redhat.com wrote:
 
 On 10/30/2013 10:42 AM, Clint Byrum wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:
 
 https://review.openstack.org/#/c/49440/
 
 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.
 
 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
 things
 through. The main reason I feel this way though, is not because
 ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.
 
 The current multi-engine paradigm has a race condition. If you have a
 stack action going on, the state is held in the engine itself, and not
 in the database, so if another engine starts working on another action,
 they will conflict.
 
 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.
 
 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the engine
 scheduler.
 
 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.
 
 Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
 has some real advantages over using the database. But there is hesitance
 because it is not widely supported in OpenStack. What say you, OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?
 
 I will -2 any patch that adds zookeeper as a dependency to Heat.
 
 The rest of the idea 

Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Joshua Harlow
Lock free solutions usually involve very predictable and very well understood 
(and typically provably correct) state machines. In openstack u have state 
machines but I think there is work to be done to increase there 
understandability and predictability. 

Lock free algorithms are also extremely hard to get right (this is why the 
above state machine typically undergoes very thought out and extensive peer 
review). In openstack we all are that peer review :)

Hopefully this discussion here (and on irc and on code reviews) can help jump 
start more thinking around this whole area.

Sent from my really tiny device...

 On Oct 30, 2013, at 12:33 PM, Qing He qing...@radisys.com wrote:
 
 Has anyone looked at any  lock-free solution?
 
 -Original Message-
 From: Sandy Walsh [mailto:sandy.wa...@rackspace.com] 
 Sent: Wednesday, October 30, 2013 12:20 PM
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey
 
 
 
 On 10/30/2013 03:10 PM, Steven Dake wrote:
 I will -2 any patch that adds zookeeper as a dependency to Heat.
 
 Certainly any distributed locking solution should be plugin based and 
 optional. Just as a database-oriented solution could be the default plugin.
 
 Re: the Java issue, we already have optional components in other languages. I 
 know Java is a different league of pain, but if it's an optional component 
 and left as a choice of the deployer, should we care?
 
 -S
 
 PS As an aside, what are your issues with ZK?
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Julien Danjou
On Wed, Oct 30 2013, Clint Byrum wrote:

 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.

I didn't check extensively the patch and what it tries to address, so
pardon me if I'm wrong, but I think there might be some overlap with
what we need in Ceilometer. We scheduled a session in Oslo to talk about
this:

  
http://icehousedesignsummit.sched.org/event/ac126a677c0c8a7960b9ff75ccc8d4d2#.UnIeVHUsJRQ
   

We want to solve this in a agnostic manner. There also might be some
overlap or complementary with TaskFlow, I hope we'll discuss this during
this session.

-- 
Julien Danjou
/* Free Software hacker * independent consultant
   http://julien.danjou.info */


signature.asc
Description: PGP signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Renat Akhmerov

On 31 окт. 2013 г., at 2:37, Clint Byrum cl...@fewbar.com wrote:

 My point
 is really that we should not care how serialization happens, we should
 just express the work-flow, and let the underlying mechanisms distribute
 and manage it as it is completed.


Sounds reasonable.

In this context, you may want to look at Mistral project we recently started. 
We just published a kind of a high-level description of the use case that is 
relevant for the problem that is being discussed here. It’s called Cloud 
Environment Deployment and you can find it at 
https://wiki.openstack.org/wiki/Mistral/Cloud_Environment_Deployment_details. 
We think it can be a very important application for Mistral. Any kinds of 
locking mechanism, imho, should be avoiding at all system levels unless it’s 
absolutely required for algorithm complexity reasons (when there’s no other 
way). So If we can represent what Heat does in its internals as a set of 
related tasks we can offload dependencies resolution to a system like Mistral 
that would do everything in a distributed manner.

Another interesting feature we’re planning to implement is data flow. That is, 
some state (sort of a context) associated with a workflow travels between nodes 
in a task graph so there’s no need to worry about a shared state in many cases. 
Data transfer itself is supposed to work on top of some HA transport itself 
(like rabbit) so it shouldn’t be a challenge to implement it. Not 100% sure 
that it’s all applicable for solving this Heat task (pardon me), but it 
definitely could be considered as a possibility. Unfortunately, we’ve not 
described this feature very well yet (only some pictures and sketches not 
published anywhere), but we’ll do soon.

I don’t really want it to look like an ad, sorry if it does ( :) ). It would be 
cool if we could collaborate on this. Once you look at our ideas, you could 
provide your input on what else should be taken into account in Mistral design 
in order to address your problem well.

Renat


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Monty Taylor


On 10/30/2013 10:42 AM, Clint Byrum wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:
 
 https://review.openstack.org/#/c/49440/
 
 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.
 
 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking things
 through. The main reason I feel this way though, is not because ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.
 
 The current multi-engine paradigm has a race condition. If you have a
 stack action going on, the state is held in the engine itself, and not
 in the database, so if another engine starts working on another action,
 they will conflict.
 
 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.
 
 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the engine
 scheduler.
 
 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.
 
 Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
 has some real advantages over using the database. But there is hesitance
 because it is not widely supported in OpenStack. What say you, OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?

Yes. I'm strongly opposed to ZooKeeper finding its way into the already
complex pile of things we use.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Sandy Walsh


On 10/30/2013 08:08 PM, Steven Dake wrote:
 On 10/30/2013 12:20 PM, Sandy Walsh wrote:

 On 10/30/2013 03:10 PM, Steven Dake wrote:
 I will -2 any patch that adds zookeeper as a dependency to Heat.
 Certainly any distributed locking solution should be plugin based and
 optional. Just as a database-oriented solution could be the default
 plugin.

 Sandy,
 
 Even if it is optional, some percentage of the userbase will enable it
 and expect the Heat community to debug and support it.

But, that's the nature of every openstack project. I don't support
HyperV in Nova or HBase in Ceilometer. The implementers deal with that
support. I can help guide someone to those people but have no intentions
of standing up those environments.

 Re: the Java issue, we already have optional components in other
 languages. I know Java is a different league of pain, but if it's an
 optional component and left as a choice of the deployer, should we care?

 -S

 PS As an aside, what are your issues with ZK?

 
 
 I realize zookeeper exists for a reason.  But unfortunately Zookeeper is
 a server, rather then an in-process library.  This means someone needs
 to figure out how to document, scale, secure, and provide high
 availability for this component.  

Yes, that's why we would use it. Same goes for rabbit and mysql.

 This is extremely challenging for the
 two server infrastructure components OpenStack server processes depend
 on today (AMQP, SQL).  If the entire OpenStack community saw value in
 biting the bullet and accepting zookeeper as a dependency and taking on
 this work, I might be more ameniable.  

Why do other services need to agree on adopting ZK? If some Heat users
need it, they can use it. Nova shouldn't care.

 What we are talking about in the
 review, however, is that the Heat team bite that bullet, which is a big
 addition to the scope of work we already execute for the ability to gain
 a distributed lock.  I would expect there are simpler approaches to
 solve the problem without dragging the baggage of a new server component
 into the OpenStack deployment.

Yes, there probably are, and alternatives are good. But, as others have
attested, ZK is tried and true. Why not support it also?

 Using zookeeper as is suggested in the review is far different then the
 way Nova uses Zookeeper.  With the Nova use case, Nova still operates
 just dandy without zookeeper.  With zookeeper in the Heat usecase, it
 essentially becomes the default way people are expected to deploy Heat.

Why, if it's a plugin?

 What I would prefer is taskflow over AMQP, to leverage existing server
 infrastructure (that has already been documented, scaled, secured, and
 HA-ified).

Same problem exists, we're just pushing the ZK decision to another service.

 Regards
 -steve
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Sandy Walsh


On 10/31/2013 11:43 AM, Monty Taylor wrote:
 
 Yes. I'm strongly opposed to ZooKeeper finding its way into the already
 complex pile of things we use.

Monty, is that just because the stack is very complicated now, or
something personal against ZK (or Java specifically)?

Curious.

-S


 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Joshua Harlow
I'm pretty sure the cats out of the bag.

https://github.com/openstack/requirements/blob/master/global-requirements.t
xt#L29

https://kazoo.readthedocs.org/en/latest/

-Josh

On 10/31/13 7:43 AM, Monty Taylor mord...@inaugust.com wrote:



On 10/30/2013 10:42 AM, Clint Byrum wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:
 
 https://review.openstack.org/#/c/49440/
 
 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.
 
 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
things
 through. The main reason I feel this way though, is not because
ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.
 
 The current multi-engine paradigm has a race condition. If you have a
 stack action going on, the state is held in the engine itself, and not
 in the database, so if another engine starts working on another action,
 they will conflict.
 
 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.
 
 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the engine
 scheduler.
 
 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.
 
 Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
 has some real advantages over using the database. But there is hesitance
 because it is not widely supported in OpenStack. What say you, OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?

Yes. I'm strongly opposed to ZooKeeper finding its way into the already
complex pile of things we use.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Monty Taylor
Sigh.

Yay We've added more competing methods of complexity!!!

Seriously. We now think that rabbit and zookeeper and mysql are ALL needed?

Joshua Harlow harlo...@yahoo-inc.com wrote:

I'm pretty sure the cats out of the bag.

https://github.com/openstack/requirements/blob/master/global-requirements.t
xt#L29

https://kazoo.readthedocs.org/en/latest/

-Josh

On 10/31/13 7:43 AM, Monty Taylor mord...@inaugust.com wrote:



On 10/30/2013 10:42 AM, Clint Byrum wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:
 
 https://review.openstack.org/#/c/49440/
 
 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.
 
 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
things
 through. The main reason I feel this way though, is not because
ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.
 
 The current multi-engine paradigm has a race condition. If you have a
 stack action going on, the state is held in the engine itself, and not
 in the database, so if another engine starts working on another action,
 they will conflict.
 
 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.
 
 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the engine
 scheduler.
 
 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.
 
 Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
 has some real advantages over using the database. But there is hesitance
 because it is not widely supported in OpenStack. What say you, OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?

Yes. I'm strongly opposed to ZooKeeper finding its way into the already
complex pile of things we use.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Joshua Harlow
In the spirt of openness, yes I do think they are all needed.

If they are not supported, then openstack is not open, it is a closed
system.

We should strive to innovate, not strive to be stuck with the status quo.

To me it is a developers decision to pick the right solution, if that
solution involves some complexity then u make it a pluggable solution.
Your view of the right solution will likely not be the view of someone
elses right solution in the end anyway (and u likely can't predict future
solutions that will be applicable anyway). If u just say no to that plugin
then u are just excluding people from participating in your project and
this is imho against the spirt of openness in general. And those people
who would have contributed will just start looking elsewhere for a
solution which does work. This kills the openstack...

On 10/31/13 10:17 AM, Monty Taylor mord...@inaugust.com wrote:

Sigh.

Yay We've added more competing methods of complexity!!!

Seriously. We now think that rabbit and zookeeper and mysql are ALL
needed?

Joshua Harlow harlo...@yahoo-inc.com wrote:

I'm pretty sure the cats out of the bag.

https://github.com/openstack/requirements/blob/master/global-requirements
.t
xt#L29

https://kazoo.readthedocs.org/en/latest/

-Josh

On 10/31/13 7:43 AM, Monty Taylor mord...@inaugust.com wrote:



On 10/30/2013 10:42 AM, Clint Byrum wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:
 
 https://review.openstack.org/#/c/49440/
 
 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based
locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.
 
 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
things
 through. The main reason I feel this way though, is not because
ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.
 
 The current multi-engine paradigm has a race condition. If you have a
 stack action going on, the state is held in the engine itself, and not
 in the database, so if another engine starts working on another
action,
 they will conflict.
 
 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.
 
 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the
engine
 scheduler.
 
 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.
 
 Anyway, as a band-aid, we may _have_ to do locking. For that,
ZooKeeper
 has some real advantages over using the database. But there is
hesitance
 because it is not widely supported in OpenStack. What say you,
OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?

Yes. I'm strongly opposed to ZooKeeper finding its way into the already
complex pile of things we use.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Monty Taylor


On 10/31/2013 01:32 PM, Joshua Harlow wrote:
 In the spirt of openness, yes I do think they are all needed.
 
 If they are not supported, then openstack is not open, it is a closed
 system.
 
 We should strive to innovate, not strive to be stuck with the status quo.
 
 To me it is a developers decision to pick the right solution, if that
 solution involves some complexity then u make it a pluggable solution.
 Your view of the right solution will likely not be the view of someone
 elses right solution in the end anyway (and u likely can't predict future
 solutions that will be applicable anyway). If u just say no to that plugin
 then u are just excluding people from participating in your project and
 this is imho against the spirt of openness in general. And those people
 who would have contributed will just start looking elsewhere for a
 solution which does work. This kills the openstack...

Hrm. I certainly don't want to kill openstack, and I _certainly_ don't
want to disallow a plugin. I apologize if it came across that way.

What I was questioning is whether or not we wanted to add a hard
requirement on zookeeper somewhere. Even Rabbit and MySQL are decisions
with other options. It's entirely possible I misread the conversation of
course...

 On 10/31/13 10:17 AM, Monty Taylor mord...@inaugust.com wrote:
 
 Sigh.

 Yay We've added more competing methods of complexity!!!

 Seriously. We now think that rabbit and zookeeper and mysql are ALL
 needed?

 Joshua Harlow harlo...@yahoo-inc.com wrote:

 I'm pretty sure the cats out of the bag.

 https://github.com/openstack/requirements/blob/master/global-requirements
 .t
 xt#L29

 https://kazoo.readthedocs.org/en/latest/

 -Josh

 On 10/31/13 7:43 AM, Monty Taylor mord...@inaugust.com wrote:



 On 10/30/2013 10:42 AM, Clint Byrum wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:

 https://review.openstack.org/#/c/49440/

 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based
 locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.

 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
 things
 through. The main reason I feel this way though, is not because
 ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.

 The current multi-engine paradigm has a race condition. If you have a
 stack action going on, the state is held in the engine itself, and not
 in the database, so if another engine starts working on another
 action,
 they will conflict.

 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.

 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the
 engine
 scheduler.

 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.

 Anyway, as a band-aid, we may _have_ to do locking. For that,
 ZooKeeper
 has some real advantages over using the database. But there is
 hesitance
 because it is not widely supported in OpenStack. What say you,
 OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?

 Yes. I'm strongly opposed to ZooKeeper finding its way into the already
 complex pile of things we use.

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-31 Thread Joshua Harlow
Agreed, 

I don't think we should enforce a hard requirement for the same reason we
should avoid zookeeper.

To me they are the same, saying 'no ZK' is the same as saying 'only mysql
or rabbitmq'...

Both not so open.

On 10/31/13 11:04 AM, Monty Taylor mord...@inaugust.com wrote:



On 10/31/2013 01:32 PM, Joshua Harlow wrote:
 In the spirt of openness, yes I do think they are all needed.
 
 If they are not supported, then openstack is not open, it is a closed
 system.
 
 We should strive to innovate, not strive to be stuck with the status
quo.
 
 To me it is a developers decision to pick the right solution, if that
 solution involves some complexity then u make it a pluggable solution.
 Your view of the right solution will likely not be the view of someone
 elses right solution in the end anyway (and u likely can't predict
future
 solutions that will be applicable anyway). If u just say no to that
plugin
 then u are just excluding people from participating in your project and
 this is imho against the spirt of openness in general. And those people
 who would have contributed will just start looking elsewhere for a
 solution which does work. This kills the openstack...

Hrm. I certainly don't want to kill openstack, and I _certainly_ don't
want to disallow a plugin. I apologize if it came across that way.

What I was questioning is whether or not we wanted to add a hard
requirement on zookeeper somewhere. Even Rabbit and MySQL are decisions
with other options. It's entirely possible I misread the conversation of
course...

 On 10/31/13 10:17 AM, Monty Taylor mord...@inaugust.com wrote:
 
 Sigh.

 Yay We've added more competing methods of complexity!!!

 Seriously. We now think that rabbit and zookeeper and mysql are ALL
 needed?

 Joshua Harlow harlo...@yahoo-inc.com wrote:

 I'm pretty sure the cats out of the bag.

 
https://github.com/openstack/requirements/blob/master/global-requiremen
ts
 .t
 xt#L29

 https://kazoo.readthedocs.org/en/latest/

 -Josh

 On 10/31/13 7:43 AM, Monty Taylor mord...@inaugust.com wrote:



 On 10/30/2013 10:42 AM, Clint Byrum wrote:
 So, recently we've had quite a long thread in gerrit regarding
locking
 in Heat:

 https://review.openstack.org/#/c/49440/

 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based
 locking
 system would. It is extremely hard to detect dead lock holders, so
we
 end up with really long timeouts. The other is ZooKeeper.

 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
 things
 through. The main reason I feel this way though, is not because
 ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.

 The current multi-engine paradigm has a race condition. If you have
a
 stack action going on, the state is held in the engine itself, and
not
 in the database, so if another engine starts working on another
 action,
 they will conflict.

 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.

 The engine should store _all_ of its state in a distributed data
store
 of some kind. Any engine should be aware of what is already
happening
 with the stack from this state and act accordingly. That includes
the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the
 engine
 scheduler.

 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries
me
 that TaskFlow has existed a while and doesn't seem to be solving
real
 problems, but maybe I'm wrong and it is actually in use already.

 Anyway, as a band-aid, we may _have_ to do locking. For that,
 ZooKeeper
 has some real advantages over using the database. But there is
 hesitance
 because it is not widely supported in OpenStack. What say you,
 OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?

 Yes. I'm strongly opposed to ZooKeeper finding its way into the
already
 complex pile of things we use.

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


 
 


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Georgy Okrokvertskhov
Hi Clint,

I think you rose a point here. We implemented distributed engine in Murano
without locking mechanism by keeping state consistent on each step. We
extracted this engine from Murano and plan to put it as a part of Mistral
project for task management and execution. Working Mistral implementation
will appear during IceHouse development. We are working closely with
taskflow team, so I think you can expect to have distributed task execution
support in taskflow library natively or through Mistral.

I am not against ZooKeeper but I think that for OpenStack service it is
better to use oslo library shared with other projects instead of adding
some custom locking mechanism for one project.

Thanks
Georgy


On Wed, Oct 30, 2013 at 10:42 AM, Clint Byrum cl...@fewbar.com wrote:

 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:

 https://review.openstack.org/#/c/49440/

 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.

 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking things
 through. The main reason I feel this way though, is not because ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.

 The current multi-engine paradigm has a race condition. If you have a
 stack action going on, the state is held in the engine itself, and not
 in the database, so if another engine starts working on another action,
 they will conflict.

 The locking paradigm is meant to prevent this. But I think this is a
 huge mistake.

 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the engine
 scheduler.

 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.

 Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
 has some real advantages over using the database. But there is hesitance
 because it is not widely supported in OpenStack. What say you, OpenStack
 community? Should we keep ZooKeeper out of our.. zoo?

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




-- 
Georgy Okrokvertskhov
Technical Program Manager,
Cloud and Infrastructure Services,
Mirantis
http://www.mirantis.com
Tel. +1 650 963 9828
Mob. +1 650 996 3284
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Alex Glikson
There is a ZK-backed driver in Nova service heartbeat mechanism (
https://blueprints.launchpad.net/nova/+spec/zk-service-heartbeat) -- would 
be interesting to know whether it is widely used (might be worth asking at 
the general ML, or user groups). There have been also discussions on using 
it for other purposes (some listed towards the bottom at 
https://wiki.openstack.org/wiki/NovaZooKeeperHeartbeat). While I am not 
aware of any particular progress with implementing any of them, I think 
they still make sense and could be useful.

Regards,
Alex




From:   Clint Byrum cl...@fewbar.com
To: openstack-dev openstack-dev@lists.openstack.org, 
Date:   30/10/2013 07:45 PM
Subject:[openstack-dev] [Heat] Locking and ZooKeeper - a space 
oddysey



So, recently we've had quite a long thread in gerrit regarding locking
in Heat:

https://review.openstack.org/#/c/49440/

In the patch, there are two distributed lock drivers. One uses SQL,
and suffers from all the problems you might imagine a SQL based locking
system would. It is extremely hard to detect dead lock holders, so we
end up with really long timeouts. The other is ZooKeeper.

I'm on record as saying we're not using ZooKeeper. It is a little
embarrassing to have taken such a position without really thinking things
through. The main reason I feel this way though, is not because ZooKeeper
wouldn't work for locking, but because I think locking is a mistake.

The current multi-engine paradigm has a race condition. If you have a
stack action going on, the state is held in the engine itself, and not
in the database, so if another engine starts working on another action,
they will conflict.

The locking paradigm is meant to prevent this. But I think this is a
huge mistake.

The engine should store _all_ of its state in a distributed data store
of some kind. Any engine should be aware of what is already happening
with the stack from this state and act accordingly. That includes the
engine currently working on actions. When viewed through this lense,
to me, locking is a poor excuse for serializing the state of the engine
scheduler.

It feels like TaskFlow is the answer, with an eye for making sure
TaskFlow can be made to work with distributed state. I am not well
versed on TaskFlow's details though, so I may be wrong. It worries me
that TaskFlow has existed a while and doesn't seem to be solving real
problems, but maybe I'm wrong and it is actually in use already.

Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
has some real advantages over using the database. But there is hesitance
because it is not widely supported in OpenStack. What say you, OpenStack
community? Should we keep ZooKeeper out of our.. zoo?

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Steven Dake

On 10/30/2013 10:42 AM, Clint Byrum wrote:

So, recently we've had quite a long thread in gerrit regarding locking
in Heat:

https://review.openstack.org/#/c/49440/

In the patch, there are two distributed lock drivers. One uses SQL,
and suffers from all the problems you might imagine a SQL based locking
system would. It is extremely hard to detect dead lock holders, so we
end up with really long timeouts. The other is ZooKeeper.

I'm on record as saying we're not using ZooKeeper. It is a little
embarrassing to have taken such a position without really thinking things
through. The main reason I feel this way though, is not because ZooKeeper
wouldn't work for locking, but because I think locking is a mistake.

The current multi-engine paradigm has a race condition. If you have a
stack action going on, the state is held in the engine itself, and not
in the database, so if another engine starts working on another action,
they will conflict.

The locking paradigm is meant to prevent this. But I think this is a
huge mistake.

The engine should store _all_ of its state in a distributed data store
of some kind. Any engine should be aware of what is already happening
with the stack from this state and act accordingly. That includes the
engine currently working on actions. When viewed through this lense,
to me, locking is a poor excuse for serializing the state of the engine
scheduler.

It feels like TaskFlow is the answer, with an eye for making sure
TaskFlow can be made to work with distributed state. I am not well
versed on TaskFlow's details though, so I may be wrong. It worries me
that TaskFlow has existed a while and doesn't seem to be solving real
problems, but maybe I'm wrong and it is actually in use already.

Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
has some real advantages over using the database. But there is hesitance
because it is not widely supported in OpenStack. What say you, OpenStack
community? Should we keep ZooKeeper out of our.. zoo?


I will -2 any patch that adds zookeeper as a dependency to Heat.

The rest of the idea sounds good though.  I spoke with Joshua about 
TaskFlow Friday as a possibility for solving this problem, but TaskFlow 
presently does not implement a distributed task flow. Joshua indicated 
there was a celerity review at https://review.openstack.org/#/c/47609/, 
but this would introduce a different server dependency which suffers 
from the same issues as Zookeeper, not to mention incomplete AMQP server 
support for various AMQP implementations.  Joshua indicated using a pure 
AMQP implementation would be possible for this job but is not implemented.


I did get into a discussion with him about the subject of breaking the 
tasks in the flow into jobs, which led to the suggestion that the 
parser should be part of the API server process (then the engine could 
be responsible for handling the various jobs Heat needs). Sounds like 
poor abstraction, not to mention serious rework required.


My take from our IRC discussion was that TaskFlow is not a job 
distribution system (like Gearman) but an in-process workflow manager.  
These two things are different.  I was unclear if Taskflow could be made 
to do both, while also operating under already supported AMQP server 
infrastructure that all of OpenStack relies on currently.  If it could, 
that would be fantastic, as we would only have to introduce a library 
dependency vs a full on server dependency with documentation, HA and 
scalability concerns.


Regards
-steve


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Mike Spreitzer
Clint Byrum cl...@fewbar.com wrote on 10/30/2013 01:42:53 PM:
 ...
 
 The engine should store _all_ of its state in a distributed data store
 of some kind. Any engine should be aware of what is already happening
 with the stack from this state and act accordingly. That includes the
 engine currently working on actions. When viewed through this lense,
 to me, locking is a poor excuse for serializing the state of the engine
 scheduler.

I agree.  I reached a similar conclusion this spring when thinking through 
the multi-engine issue for my group's work.

 It feels like TaskFlow is the answer, with an eye for making sure
 TaskFlow can be made to work with distributed state. I am not well
 versed on TaskFlow's details though, so I may be wrong. It worries me
 that TaskFlow has existed a while and doesn't seem to be solving real
 problems, but maybe I'm wrong and it is actually in use already.

As Zane pointed out when I asked, there is a difference between 
orchestration and workflow: orchestration is about bringing two things 
into alignment, and can adapt to changes in one or both along the way; I 
think of a workflow as an ossified orchestration, it is not so inherently 
adaptable (unless you write a workflow that is really an orchestrator). 
Adaptability gets even more important when you add more agents driving 
change into the system.

Regards,
Mike___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Joshua Harlow
For taskflow usage/details: https://wiki.openstack.org/wiki/TaskFlow

Let me know if the documentation there is not sufficient about defining
who its useful so that I can make it more clear (to resolve the 'versed on
TaskFlow's details' part). As I have tried to define the use-cases that it
can help solve (in fact the engine design u just described as a desired
thing for heat, it is doing, saving _all_ of its state in a distributed
store - a database right now) so there are more similarities in what u
_want_ and what taskflow actually has in its 0.1 release
(https://pypi.python.org/pypi/taskflow)

More documentation related to this:

- https://wiki.openstack.org/wiki/TaskFlow/Persistence (*)
- https://wiki.openstack.org/wiki/TaskFlow/Engines
- https://wiki.openstack.org/wiki/TaskFlow/Inputs_and_Outputs (*)

So let me know if the docs there do not describe in the detail u
want/desire and I can help there.

As for actual usage, since taskflow is ~5.2 months old (about the same age
as the heat engine code) your concern about usage is valid and I am
working at the summit to spread awareness and gain more usage (ongoing
work is happening in nova as we speak, cinder has a version in havana that
is being used for its complete create_volume workflow - one of the key
workflows there). So that¹s tremendous progress imho, to have a library
like taskflow have a 'stable' 0.1 version as well as get usage in havana
(while the library itself was being created).

That¹s amazing to me and I and others are pretty proud of that :-)

Feel free to join in some of the HK sessions that I will have.

Design sessions:

- 
http://icehousedesignsummit.sched.org/event/1ec7d73aba03ad0b95bd8de631c623c
b#.Um83SxBWlgc
- 
http://icehousedesignsummit.sched.org/event/ced7d22ac4c037f102b3cf3ade55310
4#.Um83YxBWlgc
- 
http://icehousedesignsummit.sched.org/event/c31c81de71c25333b876b0da2f430f5
0#.Um83bRBWlgc
- 
http://icehousedesignsummit.sched.org/event/5fc501fadd4faed52556ed700c39e5f
2#.Um83dhBWlgc

Speaker sessions:

- 
http://openstacksummitnovember2013.sched.org/event/29f1f996b36aaf0febc5d43b
6f53f2a4#.Um83phBWlgc


On 10/30/13 10:42 AM, Clint Byrum cl...@fewbar.com wrote:

So, recently we've had quite a long thread in gerrit regarding locking
in Heat:

https://review.openstack.org/#/c/49440/

In the patch, there are two distributed lock drivers. One uses SQL,
and suffers from all the problems you might imagine a SQL based locking
system would. It is extremely hard to detect dead lock holders, so we
end up with really long timeouts. The other is ZooKeeper.

I'm on record as saying we're not using ZooKeeper. It is a little
embarrassing to have taken such a position without really thinking things
through. The main reason I feel this way though, is not because ZooKeeper
wouldn't work for locking, but because I think locking is a mistake.

The current multi-engine paradigm has a race condition. If you have a
stack action going on, the state is held in the engine itself, and not
in the database, so if another engine starts working on another action,
they will conflict.

The locking paradigm is meant to prevent this. But I think this is a
huge mistake.

The engine should store _all_ of its state in a distributed data store
of some kind. Any engine should be aware of what is already happening
with the stack from this state and act accordingly. That includes the
engine currently working on actions. When viewed through this lense,
to me, locking is a poor excuse for serializing the state of the engine
scheduler.

It feels like TaskFlow is the answer, with an eye for making sure
TaskFlow can be made to work with distributed state. I am not well
versed on TaskFlow's details though, so I may be wrong. It worries me
that TaskFlow has existed a while and doesn't seem to be solving real
problems, but maybe I'm wrong and it is actually in use already.

Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
has some real advantages over using the database. But there is hesitance
because it is not widely supported in OpenStack. What say you, OpenStack
community? Should we keep ZooKeeper out of our.. zoo?

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Robert Collins
On 31 October 2013 06:42, Clint Byrum cl...@fewbar.com wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:

 https://review.openstack.org/#/c/49440/

 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.

 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking things
 through. The main reason I feel this way though, is not because ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.

I agree with all your points:
 - that mutex style locking here is a mistake
 - that we need a workaround in the short term
 - that sql locking can be hard to get right

However if this is a short term workaround, who cares if SQL locking
has bad failure modes: it's short term and the failure we're replacing
(engines tramping on each other) is also bad.

On Zookeeper: this would be the first Java service /required/ as part
of a deployment of OpenStack's integrated components. I think that
requires broad consensus - possibly even a TC vote - before adding it.
[NB: I'm not against Java, but it's not a social norm here]. Secondly,
but also importantly, I seem to recall Zookeeper really not being
suitable for secure environments, but maybe thats just how it was used
in my previous interactions with it?

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Joshua Harlow
So afaik hbase recommends.

http://hbase.apache.org/book/zk.sasl.auth.html

It doesn't integrate with keystone, but that¹s expected ;)

-Josh

On 10/30/13 11:25 AM, Robert Collins robe...@robertcollins.net wrote:

On 31 October 2013 06:42, Clint Byrum cl...@fewbar.com wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:

 https://review.openstack.org/#/c/49440/

 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.

 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
things
 through. The main reason I feel this way though, is not because
ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.

I agree with all your points:
 - that mutex style locking here is a mistake
 - that we need a workaround in the short term
 - that sql locking can be hard to get right

However if this is a short term workaround, who cares if SQL locking
has bad failure modes: it's short term and the failure we're replacing
(engines tramping on each other) is also bad.

On Zookeeper: this would be the first Java service /required/ as part
of a deployment of OpenStack's integrated components. I think that
requires broad consensus - possibly even a TC vote - before adding it.
[NB: I'm not against Java, but it's not a social norm here]. Secondly,
but also importantly, I seem to recall Zookeeper really not being
suitable for secure environments, but maybe thats just how it was used
in my previous interactions with it?

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Joshua Harlow
As for the mutex and locking and all that problem.

I would expect locking to be a necessity at some point for openstack.

Even if the state transitions are the locks themselves (that¹s still a
lock by another name imho) and u need a reliable way to store and change
those state transitions (aka what the last one was, what the next one is).
A database can be likely ok here but is not ideal as complexity increases.
The other part that I think zookeepr addresses is similar to how it is
used in nova, where its used as a 'liveness' system, where instead of
consistently updating a database with heartbeats zookeeper itself
maintains that information without requiring constant updates to a DB
(which doesn't scale).

There are fairly good reason why zookeeper and chubby (and similar
systems) exist :)

- http://research.google.com/archive/chubby.html
- 
http://labs.yahoo.com/publication/zookeeper-wait-free-coordination-for-inte
rnet-scale-systems/
- http://devo.ps/blog/2013/09/11/zookeeper-vs-doozer-vs-etcd.html


Of course the usage of such systems must be carefully discussed and
thought out, but that¹s nothing new to everyone here.

On 10/30/13 11:25 AM, Robert Collins robe...@robertcollins.net wrote:

On 31 October 2013 06:42, Clint Byrum cl...@fewbar.com wrote:
 So, recently we've had quite a long thread in gerrit regarding locking
 in Heat:

 https://review.openstack.org/#/c/49440/

 In the patch, there are two distributed lock drivers. One uses SQL,
 and suffers from all the problems you might imagine a SQL based locking
 system would. It is extremely hard to detect dead lock holders, so we
 end up with really long timeouts. The other is ZooKeeper.

 I'm on record as saying we're not using ZooKeeper. It is a little
 embarrassing to have taken such a position without really thinking
things
 through. The main reason I feel this way though, is not because
ZooKeeper
 wouldn't work for locking, but because I think locking is a mistake.

I agree with all your points:
 - that mutex style locking here is a mistake
 - that we need a workaround in the short term
 - that sql locking can be hard to get right

However if this is a short term workaround, who cares if SQL locking
has bad failure modes: it's short term and the failure we're replacing
(engines tramping on each other) is also bad.

On Zookeeper: this would be the first Java service /required/ as part
of a deployment of OpenStack's integrated components. I think that
requires broad consensus - possibly even a TC vote - before adding it.
[NB: I'm not against Java, but it's not a social norm here]. Secondly,
but also importantly, I seem to recall Zookeeper really not being
suitable for secure environments, but maybe thats just how it was used
in my previous interactions with it?

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Tim Daly

On Oct 30, 2013, at 11:11 AM, Alex Glikson glik...@il.ibm.com wrote:

 There is a ZK-backed driver in Nova service heartbeat mechanism 
 (https://blueprints.launchpad.net/nova/+spec/zk-service-heartbeat) -- would 
 be interesting to know whether it is widely used (might be worth asking at 
 the general ML, or user groups). There have been also discussions on using it 
 for other purposes (some listed towards the bottom at 
 https://wiki.openstack.org/wiki/NovaZooKeeperHeartbeat). While I am not aware 
 of any particular progress with implementing any of them, I think they still 
 make sense and could be useful. 
 
 Regards, 
 Alex 

Zookeeper is very useful, and we already deploy it on our clusters here at Y!.  
But we only use it for leader election of nova scheduler and network processes, 
not for service groups yet.  I've been meaning to try it for service groups, 
though, because I see the service group database queries eating a lot of time 
in the scheduler process on large (hundreds of nodes) clusters.  Presumably the 
ZK driver would be cheaper, and would hopefully also not leave such a long 
window before it notices a node is down.  Just haven't gotten around to it yet.

I have yet to see any other mature open source distributed coordination 
service.  Yahoo has run zookeeper in production for ages now, since long before 
we started using openstack.

For yahoo, python is the new language that we're not sure is good for 
production.  Each process effectively constrained to a single CPU, no good 
static checker, so many random little libraries.  Obviously we're using it, but 
it's new to us.  But Java?  It must be the number one or maybe number two 
language for production code the world over right now.


Cheers,
Tim



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Sandy Walsh


On 10/30/2013 03:10 PM, Steven Dake wrote:
 I will -2 any patch that adds zookeeper as a dependency to Heat.

Certainly any distributed locking solution should be plugin based and
optional. Just as a database-oriented solution could be the default plugin.

Re: the Java issue, we already have optional components in other
languages. I know Java is a different league of pain, but if it's an
optional component and left as a choice of the deployer, should we care?

-S

PS As an aside, what are your issues with ZK?


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Jay Pipes

Has anyone looked into using concoord for distributed locking?

https://pypi.python.org/pypi/concoord

Best,
-jay

On 10/30/2013 02:39 PM, Joshua Harlow wrote:

So my idea here was that to break the abstraction for heat into 3 parts.

Pardon my lack of heat terminology/knowledge if I miss something.

1. The thing that receives the API request (I would assume an api server
here).

I would expect #1 to parse something into a known internal format. Whether
this is tasks or jobs or something is up to heat, so this might of been my
lack of understanding heat concepts here, but usually an API request
translates into some internal format. Maybe this is the parser or
something else (not sure really).

Lets assume for now that it parses the API request into some tasks + flow
(what taskflow provides).

So then it becomes a question of how what do u do with those tasks  flows
(what I call stage #2).

- https://wiki.openstack.org/wiki/TaskFlow#Two_thousand_foot_view

To me this is where taskflow 'shines' in that it has an engine concept
which can run in various manners (the tasks and flow are not strongly
associated with a engine). One of these engines is planned to be a
distributed one (but its not the only one) and with that engine type it
would have to interact with some type of job management system (or it
would have to provide that job management system - or a simple version
itself), but the difference is that the about tasks and flows (and the
links/structure between them) is still disconnected from the actual engine
that runs those tasks  flows. So this to mean means that there is
plugabbility with regard to execution, which I think is pretty great.

If that requires rework of the heat model, way of running, maybe its for
the better? Idk.

As taskflow is still newish, and most projects in openstack have there own
distributed model (conductors, rpc process separation), we wanted to focus
on having the basic principles down, and the review
https://review.openstack.org/#/c/47609/ I am very grateful for jessica
working her hardest to get that in a nearly there state. So yes, taskflow
will continue on the path/spirit of 47609, and contributions are welcome
of course :-)

Feel free to also jump on #openstack-state-management since it might be
easier to just chat there in the end with other interested parties.

-Josh

On 10/30/13 11:10 AM, Steven Dake sd...@redhat.com wrote:


On 10/30/2013 10:42 AM, Clint Byrum wrote:

So, recently we've had quite a long thread in gerrit regarding locking
in Heat:

https://review.openstack.org/#/c/49440/

In the patch, there are two distributed lock drivers. One uses SQL,
and suffers from all the problems you might imagine a SQL based locking
system would. It is extremely hard to detect dead lock holders, so we
end up with really long timeouts. The other is ZooKeeper.

I'm on record as saying we're not using ZooKeeper. It is a little
embarrassing to have taken such a position without really thinking
things
through. The main reason I feel this way though, is not because
ZooKeeper
wouldn't work for locking, but because I think locking is a mistake.

The current multi-engine paradigm has a race condition. If you have a
stack action going on, the state is held in the engine itself, and not
in the database, so if another engine starts working on another action,
they will conflict.

The locking paradigm is meant to prevent this. But I think this is a
huge mistake.

The engine should store _all_ of its state in a distributed data store
of some kind. Any engine should be aware of what is already happening
with the stack from this state and act accordingly. That includes the
engine currently working on actions. When viewed through this lense,
to me, locking is a poor excuse for serializing the state of the engine
scheduler.

It feels like TaskFlow is the answer, with an eye for making sure
TaskFlow can be made to work with distributed state. I am not well
versed on TaskFlow's details though, so I may be wrong. It worries me
that TaskFlow has existed a while and doesn't seem to be solving real
problems, but maybe I'm wrong and it is actually in use already.

Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
has some real advantages over using the database. But there is hesitance
because it is not widely supported in OpenStack. What say you, OpenStack
community? Should we keep ZooKeeper out of our.. zoo?


I will -2 any patch that adds zookeeper as a dependency to Heat.

The rest of the idea sounds good though.  I spoke with Joshua about
TaskFlow Friday as a possibility for solving this problem, but TaskFlow
presently does not implement a distributed task flow. Joshua indicated
there was a celerity review at https://review.openstack.org/#/c/47609/,
but this would introduce a different server dependency which suffers
from the same issues as Zookeeper, not to mention incomplete AMQP server
support for various AMQP implementations.  Joshua indicated using a pure
AMQP 

Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Qing He
Has anyone looked at any  lock-free solution?

-Original Message-
From: Sandy Walsh [mailto:sandy.wa...@rackspace.com] 
Sent: Wednesday, October 30, 2013 12:20 PM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey



On 10/30/2013 03:10 PM, Steven Dake wrote:
 I will -2 any patch that adds zookeeper as a dependency to Heat.

Certainly any distributed locking solution should be plugin based and optional. 
Just as a database-oriented solution could be the default plugin.

Re: the Java issue, we already have optional components in other languages. I 
know Java is a different league of pain, but if it's an optional component and 
left as a choice of the deployer, should we care?

-S

PS As an aside, what are your issues with ZK?


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2013-10-30 11:57:30 -0700:
 As for the mutex and locking and all that problem.
 
 I would expect locking to be a necessity at some point for openstack.
 
 Even if the state transitions are the locks themselves (that¹s still a
 lock by another name imho) and u need a reliable way to store and change
 those state transitions (aka what the last one was, what the next one is).
 A database can be likely ok here but is not ideal as complexity increases.
 The other part that I think zookeepr addresses is similar to how it is
 used in nova, where its used as a 'liveness' system, where instead of
 consistently updating a database with heartbeats zookeeper itself
 maintains that information without requiring constant updates to a DB
 (which doesn't scale).

I'm not so concerned with locks under the covers, I am concerned with
locking as a paradigm. It is too low level for what Heat is aiming at.
Yes of course serialization can and often does rely on locking. My point
is really that we should not care how serialization happens, we should
just express the work-flow, and let the underlying mechanisms distribute
and manage it as it is completed. My sincerest hope is that we can make
TaskFlow do this.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Sandy Walsh
Doh, sorry, left out the important part I had originally intended.

The ZK unit tests could be split to not run by default, but if you're a
ZK shop ... run them yourself. They might not be included in the gerrit
tests, but should be the nature with heavy-weight drivers.

We need to do more of this test splitting in general anyway.

-S


On 10/30/2013 04:20 PM, Sandy Walsh wrote:
 
 
 On 10/30/2013 03:10 PM, Steven Dake wrote:
 I will -2 any patch that adds zookeeper as a dependency to Heat.
 
 Certainly any distributed locking solution should be plugin based and
 optional. Just as a database-oriented solution could be the default plugin.
 
 Re: the Java issue, we already have optional components in other
 languages. I know Java is a different league of pain, but if it's an
 optional component and left as a choice of the deployer, should we care?
 
 -S
 
 PS As an aside, what are your issues with ZK?
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Robert Collins
On 31 October 2013 08:37, Sandy Walsh sandy.wa...@rackspace.com wrote:
 Doh, sorry, left out the important part I had originally intended.

 The ZK unit tests could be split to not run by default, but if you're a
 ZK shop ... run them yourself. They might not be included in the gerrit
 tests, but should be the nature with heavy-weight drivers.

 We need to do more of this test splitting in general anyway.

Yes... but.

We need to aim at production. If ZK is going to be the production sane
way of doing it with the reference OpenStack code base, then we
absolutely have to have our functional and integration tests run with
ZK. Unit tests shouldn't be talking to a live ZK anyhow, so they don't
concern me.

-Rob

-- 
Robert Collins rbtcoll...@hp.com
Distinguished Technologist
HP Converged Cloud

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Joshua Harlow
+1 ;)

On 10/30/13 12:37 PM, Clint Byrum cl...@fewbar.com wrote:

Excerpts from Joshua Harlow's message of 2013-10-30 11:57:30 -0700:
 As for the mutex and locking and all that problem.
 
 I would expect locking to be a necessity at some point for openstack.
 
 Even if the state transitions are the locks themselves (that¹s still a
 lock by another name imho) and u need a reliable way to store and change
 those state transitions (aka what the last one was, what the next one
is).
 A database can be likely ok here but is not ideal as complexity
increases.
 The other part that I think zookeepr addresses is similar to how it is
 used in nova, where its used as a 'liveness' system, where instead of
 consistently updating a database with heartbeats zookeeper itself
 maintains that information without requiring constant updates to a DB
 (which doesn't scale).

I'm not so concerned with locks under the covers, I am concerned with
locking as a paradigm. It is too low level for what Heat is aiming at.
Yes of course serialization can and often does rely on locking. My point
is really that we should not care how serialization happens, we should
just express the work-flow, and let the underlying mechanisms distribute
and manage it as it is completed. My sincerest hope is that we can make
TaskFlow do this.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Chris Friesen

On 10/30/2013 01:34 PM, Joshua Harlow wrote:

To me u just made state consistency be a lock by another name.

A lock protects a region of code from being mutually accessed


Personally I view a lock as protecting a set of data from being mutually 
accessed.



The question to me becomes what
happens to that state consistency when its running in a distributed
system, which all of openstack is running in. At that point u need a way
to ensure multiple servers (going through various states) are not
manipulating the same resources at the same time (delete volume from
cinder, while attaching it in nova). Those 2 separate services do not
likely share the same state transitions (and will likely not as they
become tightly coupled at that point). So then u need some type of
coordination system to ensure the ordering of these 2 resource actions
is done in a consistent manner.


This sort of thing seems solvable by a reserve-before-use kind of 
model, without needing any mutex locking as such.


When attaching, do an atomic 
check-if-owner-is-empty-and-store-instance-as-owner transaction to 
store the instance as the owner of the volume very early. Then reload 
from the database to make sure the instance is the current owner, and 
now you're guaranteed that nobody can delete it under your feet.


When deleting, if the current owner is set and the owner instance exists 
then bail out with an error.


This is essentially akin to using atomic-test-and-set instead of a mutex.

Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Sandy Walsh


On 10/30/2013 04:44 PM, Robert Collins wrote:
 On 31 October 2013 08:37, Sandy Walsh sandy.wa...@rackspace.com wrote:
 Doh, sorry, left out the important part I had originally intended.

 The ZK unit tests could be split to not run by default, but if you're a
 ZK shop ... run them yourself. They might not be included in the gerrit
 tests, but should be the nature with heavy-weight drivers.

 We need to do more of this test splitting in general anyway.
 
 Yes... but.
 
 We need to aim at production. If ZK is going to be the production sane
 way of doing it with the reference OpenStack code base, then we
 absolutely have to have our functional and integration tests run with
 ZK. Unit tests shouldn't be talking to a live ZK anyhow, so they don't
 concern me.

Totally agree at the functional/integration test level. My concern was
having to bring ZK into a dev env.

We've already set the precedent with Erlang (rabbitmq). There are HBase
(Java) drivers out there and Torpedo tests against a variety of other
databases.

I think the horse has left the barn.


 
 -Rob
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Angus Salkeld

On 30/10/13 15:27 -0400, Jay Pipes wrote:

Has anyone looked into using concoord for distributed locking?

https://pypi.python.org/pypi/concoord


Looks interesting!

-Angus



Best,
-jay

On 10/30/2013 02:39 PM, Joshua Harlow wrote:

So my idea here was that to break the abstraction for heat into 3 parts.

Pardon my lack of heat terminology/knowledge if I miss something.

1. The thing that receives the API request (I would assume an api server
here).

I would expect #1 to parse something into a known internal format. Whether
this is tasks or jobs or something is up to heat, so this might of been my
lack of understanding heat concepts here, but usually an API request
translates into some internal format. Maybe this is the parser or
something else (not sure really).

Lets assume for now that it parses the API request into some tasks + flow
(what taskflow provides).

So then it becomes a question of how what do u do with those tasks  flows
(what I call stage #2).

- https://wiki.openstack.org/wiki/TaskFlow#Two_thousand_foot_view

To me this is where taskflow 'shines' in that it has an engine concept
which can run in various manners (the tasks and flow are not strongly
associated with a engine). One of these engines is planned to be a
distributed one (but its not the only one) and with that engine type it
would have to interact with some type of job management system (or it
would have to provide that job management system - or a simple version
itself), but the difference is that the about tasks and flows (and the
links/structure between them) is still disconnected from the actual engine
that runs those tasks  flows. So this to mean means that there is
plugabbility with regard to execution, which I think is pretty great.

If that requires rework of the heat model, way of running, maybe its for
the better? Idk.

As taskflow is still newish, and most projects in openstack have there own
distributed model (conductors, rpc process separation), we wanted to focus
on having the basic principles down, and the review
https://review.openstack.org/#/c/47609/ I am very grateful for jessica
working her hardest to get that in a nearly there state. So yes, taskflow
will continue on the path/spirit of 47609, and contributions are welcome
of course :-)

Feel free to also jump on #openstack-state-management since it might be
easier to just chat there in the end with other interested parties.

-Josh

On 10/30/13 11:10 AM, Steven Dake sd...@redhat.com wrote:


On 10/30/2013 10:42 AM, Clint Byrum wrote:

So, recently we've had quite a long thread in gerrit regarding locking
in Heat:

https://review.openstack.org/#/c/49440/

In the patch, there are two distributed lock drivers. One uses SQL,
and suffers from all the problems you might imagine a SQL based locking
system would. It is extremely hard to detect dead lock holders, so we
end up with really long timeouts. The other is ZooKeeper.

I'm on record as saying we're not using ZooKeeper. It is a little
embarrassing to have taken such a position without really thinking
things
through. The main reason I feel this way though, is not because
ZooKeeper
wouldn't work for locking, but because I think locking is a mistake.

The current multi-engine paradigm has a race condition. If you have a
stack action going on, the state is held in the engine itself, and not
in the database, so if another engine starts working on another action,
they will conflict.

The locking paradigm is meant to prevent this. But I think this is a
huge mistake.

The engine should store _all_ of its state in a distributed data store
of some kind. Any engine should be aware of what is already happening
with the stack from this state and act accordingly. That includes the
engine currently working on actions. When viewed through this lense,
to me, locking is a poor excuse for serializing the state of the engine
scheduler.

It feels like TaskFlow is the answer, with an eye for making sure
TaskFlow can be made to work with distributed state. I am not well
versed on TaskFlow's details though, so I may be wrong. It worries me
that TaskFlow has existed a while and doesn't seem to be solving real
problems, but maybe I'm wrong and it is actually in use already.

Anyway, as a band-aid, we may _have_ to do locking. For that, ZooKeeper
has some real advantages over using the database. But there is hesitance
because it is not widely supported in OpenStack. What say you, OpenStack
community? Should we keep ZooKeeper out of our.. zoo?


I will -2 any patch that adds zookeeper as a dependency to Heat.

The rest of the idea sounds good though.  I spoke with Joshua about
TaskFlow Friday as a possibility for solving this problem, but TaskFlow
presently does not implement a distributed task flow. Joshua indicated
there was a celerity review at https://review.openstack.org/#/c/47609/,
but this would introduce a different server dependency which suffers
from the same issues as Zookeeper, not to mention incomplete AMQP server
support 

Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Joshua Harlow
This works as long as you have 1 DB and don't fail over to a secondary
slave DB.

Now u can say we all must use percona (or similar) for this, but then
that¹s a change in deployment as well (and imho a bigger one). This is
where the concept of a quorum in zookeeper comes into play, the
transaction log that zookeeper maintains will be consistent among all
members in that quorum. This is a typical zookeeper deployment strategy
(select how many nodes u want for that quorum being an important question).

It also doesn't handle the case where u can automatically recover from the
current resource owner (nova-compute for example) dying.

Your atomic check-if-owner-is-empty-and-store-instance-as-owner is now
user initiated instead of being automatic (zookeeper provides these kind
of notifications via its watch concept). So that makes it hard for say an
automated system (heat?) to react to these failures in any other way than
repeated polling (or repeated retries or periodic tasks) which means that
heat will not be able to react to failure in a 'live' manner. So this to
me is the liveness question that zookeeper is designed to help out with,
of course u can simulate this in a DB and repeated polling (as long as u
don't try to do anything complicated with mysql, like replicas/slaves with
transaction logs that may not be caught up and that u might have to fail
over to if problems happen, since u are on your own if this happens).

-Josh

On 10/30/13 1:02 PM, Chris Friesen chris.frie...@windriver.com wrote:

On 10/30/2013 01:34 PM, Joshua Harlow wrote:
 To me u just made state consistency be a lock by another name.

 A lock protects a region of code from being mutually accessed

Personally I view a lock as protecting a set of data from being mutually
accessed.

 The question to me becomes what
 happens to that state consistency when its running in a distributed
 system, which all of openstack is running in. At that point u need a way
 to ensure multiple servers (going through various states) are not
 manipulating the same resources at the same time (delete volume from
 cinder, while attaching it in nova). Those 2 separate services do not
 likely share the same state transitions (and will likely not as they
 become tightly coupled at that point). So then u need some type of
 coordination system to ensure the ordering of these 2 resource actions
 is done in a consistent manner.

This sort of thing seems solvable by a reserve-before-use kind of
model, without needing any mutex locking as such.

When attaching, do an atomic
check-if-owner-is-empty-and-store-instance-as-owner transaction to
store the instance as the owner of the volume very early. Then reload
from the database to make sure the instance is the current owner, and
now you're guaranteed that nobody can delete it under your feet.

When deleting, if the current owner is set and the owner instance exists
then bail out with an error.

This is essentially akin to using atomic-test-and-set instead of a mutex.

Chris

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2013-10-30 17:46:44 -0700:
 This works as long as you have 1 DB and don't fail over to a secondary
 slave DB.
 
 Now u can say we all must use percona (or similar) for this, but then

Did you mean Galera which provides multiple synchronous masters?

 that¹s a change in deployment as well (and imho a bigger one). This is
 where the concept of a quorum in zookeeper comes into play, the
 transaction log that zookeeper maintains will be consistent among all
 members in that quorum. This is a typical zookeeper deployment strategy
 (select how many nodes u want for that quorum being an important question).


Galera uses more or less the exact same mechanism.

 It also doesn't handle the case where u can automatically recover from the
 current resource owner (nova-compute for example) dying.


I don't know what that means.

 Your atomic check-if-owner-is-empty-and-store-instance-as-owner is now
 user initiated instead of being automatic (zookeeper provides these kind
 of notifications via its watch concept). So that makes it hard for say an
 automated system (heat?) to react to these failures in any other way than
 repeated polling (or repeated retries or periodic tasks) which means that
 heat will not be able to react to failure in a 'live' manner. So this to
 me is the liveness question that zookeeper is designed to help out with,
 of course u can simulate this in a DB and repeated polling (as long as u
 don't try to do anything complicated with mysql, like replicas/slaves with
 transaction logs that may not be caught up and that u might have to fail
 over to if problems happen, since u are on your own if this happens).


Right, even if you have a Galera cluster you still have to poll it or
use wonky things like triggers hooked up to memcache/gearman/amqp UDF's
to get around polling latency.

I think your point is that a weird MySQL is just as disruptive to the
normal OpenStack deployment as a weird service like ZooKeeper.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Heat] Locking and ZooKeeper - a space oddysey

2013-10-30 Thread Joshua Harlow
Yup, galera, thx! :)

As for the:

It also doesn't handle the case where u can automatically recover from the
current resource owner (nova-compute for example) dying.

So heat is actively working on some resources, doing its thing, its binary
crashes (or kill -9 occurs), what happens? The same question u can ask for
nova-compute. Hope that makes more sense now. To me u need a system that
can detect liveness of processes and can automatically handle the case
where it dies (maybe by starting up another heat, or nova-compute or ...).

But ya, your summary is right, distributed systems are wonky just in
general. But all I can say is that zookeeper is pretty battle tested :)

On 10/30/13 6:04 PM, Clint Byrum cl...@fewbar.com wrote:

Excerpts from Joshua Harlow's message of 2013-10-30 17:46:44 -0700:
 This works as long as you have 1 DB and don't fail over to a secondary
 slave DB.
 
 Now u can say we all must use percona (or similar) for this, but then

Did you mean Galera which provides multiple synchronous masters?

 that¹s a change in deployment as well (and imho a bigger one). This is
 where the concept of a quorum in zookeeper comes into play, the
 transaction log that zookeeper maintains will be consistent among all
 members in that quorum. This is a typical zookeeper deployment strategy
 (select how many nodes u want for that quorum being an important
question).


Galera uses more or less the exact same mechanism.

 It also doesn't handle the case where u can automatically recover from
the
 current resource owner (nova-compute for example) dying.


I don't know what that means.

 Your atomic check-if-owner-is-empty-and-store-instance-as-owner is now
 user initiated instead of being automatic (zookeeper provides these kind
 of notifications via its watch concept). So that makes it hard for say
an
 automated system (heat?) to react to these failures in any other way
than
 repeated polling (or repeated retries or periodic tasks) which means
that
 heat will not be able to react to failure in a 'live' manner. So this to
 me is the liveness question that zookeeper is designed to help out with,
 of course u can simulate this in a DB and repeated polling (as long as u
 don't try to do anything complicated with mysql, like replicas/slaves
with
 transaction logs that may not be caught up and that u might have to fail
 over to if problems happen, since u are on your own if this happens).


Right, even if you have a Galera cluster you still have to poll it or
use wonky things like triggers hooked up to memcache/gearman/amqp UDF's
to get around polling latency.

I think your point is that a weird MySQL is just as disruptive to the
normal OpenStack deployment as a weird service like ZooKeeper.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev