[ 
https://issues.apache.org/jira/browse/WHIRR-329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051544#comment-13051544
 ] 

Tibor Kiss commented on WHIRR-329:
----------------------------------

Indifferent of which kind of resource it is used to implement such a locking 
mechanism, the problem with it is that in case of cluster disappears without 
cleaning up the lock, it affects any new normal cluster startups.
Remember the amazon security group object associated to the cluster before this 
process was not idempotent it played the role of a locking mechanism. But it 
should made idempotent exactly because of the previously mentioned cause.

Since I was the one who mentioned that locking mechanism how useful it is, I 
have to mention my experiences with it.
First consider the following possible use case categorization. One where 
concurrently each user may attempt to start the same cluster setup from 
different locations, there it would help only a kind of two-phase commit based 
locking mechanism. The second use case is that a cluster is to be started from 
a single endpoint, in that case the problem of avoiding a start over is an 
application coordination problem where it is possible much easier to implement 
such a logic, without having some distributed locking in whirr. I think that 
the second use case category is the real one, the first case just does not 
exists if cluster names are used in an organised way.

Another consideration is that a locking mechanism usually prohibit a process 
(in our case the cluster startup). Instead of prohibiting the cluster startup, 
a second attempt would just contact the alredy started cluster. Of course there 
are some details regarding previous unsuccessfull attempt etc. 

At the beginning of the process, a checkpoint marker has to placed in a common 
place (in the blobstore cache or if not used, just in local drive).
In that checkpoint marker the cluster startup attempt time and the timeout 
value of a startup process has to be written. The timeout is important to keep 
out a second attempt a sufficient time only and no more. If an attempt observes 
such a checkpoint, then automatically give up to start a new cluster over and 
continues to read up Instance objects to contact the cluster. If the cluster 
state information is not yet saved and the checkpoint marker is not expired, 
then it notifies the caller about a cluster starting process by somebody else, 
telling them the next possible time to retry. If the cluster state information 
is missing and also the checkpoint marker is expired, then the client in cause 
is trying to continue a new cluster startup.
Also at the cluster shutdown, that checkpoint marker has to be cleared. 
Note that, here we are not handling racing conditions (as I explained earlier, 
we not need to), only our incidental but successive attempts are handled which 
does not really cause racing conditions in between reading the checkpoint and 
writing it. Therefore it has been simplified the locking mechanism only to an 
logic how to handle the already existing startup process and which are the 
differences.

To sum it up, I dont think it is necessary a pure distributed locking 
mechanism, but the small change to be able to contact an already started 
cluster. What is your opinion?

> Ensure the same cluster can't be started twice
> ----------------------------------------------
>
>                 Key: WHIRR-329
>                 URL: https://issues.apache.org/jira/browse/WHIRR-329
>             Project: Whirr
>          Issue Type: Improvement
>            Reporter: Andrei Savu
>            Assignee: Andrei Savu
>
> We should add some sort of distributed locking mechanism so that we can be 
> sure that the same cluster can't be started twice. Tibor was the first one to 
> raise this issue. 
> Could a blobstore be used for locking? 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to