[
https://issues.apache.org/jira/browse/WHIRR-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051544#comment-13051544
]
Tibor Kiss commented on WHIRR-329:
----------------------------------
Indifferent of which kind of resource it is used to implement such a locking
mechanism, the problem with it is that in case of cluster disappears without
cleaning up the lock, it affects any new normal cluster startups.
Remember the amazon security group object associated to the cluster before this
process was not idempotent it played the role of a locking mechanism. But it
should made idempotent exactly because of the previously mentioned cause.
Since I was the one who mentioned that locking mechanism how useful it is, I
have to mention my experiences with it.
First consider the following possible use case categorization. One where
concurrently each user may attempt to start the same cluster setup from
different locations, there it would help only a kind of two-phase commit based
locking mechanism. The second use case is that a cluster is to be started from
a single endpoint, in that case the problem of avoiding a start over is an
application coordination problem where it is possible much easier to implement
such a logic, without having some distributed locking in whirr. I think that
the second use case category is the real one, the first case just does not
exists if cluster names are used in an organised way.
Another consideration is that a locking mechanism usually prohibit a process
(in our case the cluster startup). Instead of prohibiting the cluster startup,
a second attempt would just contact the alredy started cluster. Of course there
are some details regarding previous unsuccessfull attempt etc.
At the beginning of the process, a checkpoint marker has to placed in a common
place (in the blobstore cache or if not used, just in local drive).
In that checkpoint marker the cluster startup attempt time and the timeout
value of a startup process has to be written. The timeout is important to keep
out a second attempt a sufficient time only and no more. If an attempt observes
such a checkpoint, then automatically give up to start a new cluster over and
continues to read up Instance objects to contact the cluster. If the cluster
state information is not yet saved and the checkpoint marker is not expired,
then it notifies the caller about a cluster starting process by somebody else,
telling them the next possible time to retry. If the cluster state information
is missing and also the checkpoint marker is expired, then the client in cause
is trying to continue a new cluster startup.
Also at the cluster shutdown, that checkpoint marker has to be cleared.
Note that, here we are not handling racing conditions (as I explained earlier,
we not need to), only our incidental but successive attempts are handled which
does not really cause racing conditions in between reading the checkpoint and
writing it. Therefore it has been simplified the locking mechanism only to an
logic how to handle the already existing startup process and which are the
differences.
To sum it up, I dont think it is necessary a pure distributed locking
mechanism, but the small change to be able to contact an already started
cluster. What is your opinion?
> Ensure the same cluster can't be started twice
> ----------------------------------------------
>
> Key: WHIRR-329
> URL: https://issues.apache.org/jira/browse/WHIRR-329
> Project: Whirr
> Issue Type: Improvement
> Reporter: Andrei Savu
> Assignee: Andrei Savu
>
> We should add some sort of distributed locking mechanism so that we can be
> sure that the same cluster can't be started twice. Tibor was the first one to
> raise this issue.
> Could a blobstore be used for locking?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira