[ClusterLabs] clusterlabs.org now supports https :-)

2017-06-26 Thread Ken Gaillot
Thanks to the wonderful service provided by Let's Encrypt[1], we now
have an SSL certificate for the ClusterLabs websites. You can use the
websites with secure encryption by starting the URL with "https", for
example:

   https://www.clusterlabs.org/

The ClusterLabs wiki[2] and bugzilla[3] sites, which accept logins, now
always redirect to https, so passwords are never sent in clear text.
While we have no indication that any accounts have ever been
compromised, it's a good time to login and change your password if you
have an account on one of these sites.

[1] https://letsencrypt.org/
[2] https://wiki.clusterlabs.org/
[3] https://bugs.clusterlabs.org/
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf_take_lock is NOT actually safe to use

2017-06-26 Thread Dejan Muhamedagic
Hi,

On Wed, Jun 21, 2017 at 04:40:47PM +0200, Lars Ellenberg wrote:
> 
> Repost to a wider audience, to raise awareness for this.
> ocf_take_lock may or may not be better than nothing.
> 
> It at least "annotates" that the auther would like to protect something
> that is considered a "critical region" of the resource agent.
> 
> At the same time, it does NOT deliver what the name seems to imply.
> 

Lars, many thanks for the analysis and bringing this up again.

I'm not going to take on the details below, just to say that
there's now a pull request for the issue:

https://github.com/ClusterLabs/resource-agents/pull/995

In short, it consists of reducing the race window size (by using
mkdir*), double test for stale locks, and improved random number
function. I ran numerous tests with and without stale locks and
it seems to hold quite well.

The comments there contain a detailed description of the
approach.

Please review and comment whoever finds time.

Cheers,

Dejan

*) Though the current implementation uses just a file and the
proposed one directories, the locks are short lived and there
shouldn't be problems on upgrades.

> I think I brought this up a few times over the years, but was not noisy
> enough about it, because it seemed not important enough: no-one was
> actually using this anyways.
> 
> But since new usage has been recently added with
> [ClusterLabs/resource-agents] targetcli lockfile (#917)
> here goes:
> 
> On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote:
> > On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote:
> > > Note: ocf_take_lock is NOT actually safe to use.
> > > 
> > > As implemented, it uses "echo $pid > lockfile" to create the lockfile,
> > > which means if several such "ocf_take_lock" happen at the same time,
> > > they all "succeed", only the last one will be the "visible" one to future 
> > > waiters.
> > 
> > Ugh.
> 
> Exactly.
> 
> Reproducer:
> #
> #!/bin/bash
> export OCF_ROOT=/usr/lib/ocf/ ;
> .  /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs ;
> 
> x() (
>   ocf_take_lock dummy-lock ;
>   ocf_release_lock_on_exit dummy-lock  ;
>   set -C;
>   echo x > protected && sleep 0.15 && rm -f protected || touch BROKEN;
> );
> 
> mkdir -p /run/ocf_take_lock_demo
> cd /run/ocf_take_lock_demo
> rm -f BROKEN; i=0;
> time while ! test -e BROKEN; do
>   x &  x &
>   wait;
>   i=$(( i+1 ));
> done ;
> test -e BROKEN && echo "reproduced race in $i iterations"
> #
> 
> x() above takes, and, because of the () subshell and
> ocf_release_lock_on_exit, releases the "dummy-lock",
> and within the protected region of code,
> creates and removes a file "protected".
> 
> If ocf_take_lock was good, there could never be two instances
> inside the lock, so echo x > protected should never fail.
> 
> With the current implementation of ocf_take_lock,
> it takes "just a few" iterations here to reproduce the race.
> (usually within a minute).
> 
> The races I see in ocf_take_lock:
> "creation race":
>   test -e $lock
>   # someone else may create it here
>   echo $$ > $lock
>   # but we override it with ours anyways
> 
> "still empty race":
>   test -e $lock   # maybe it already exists (open O_CREAT|O_TRUNC)
>   # but does not yet contain target pid,
>   pid=`cat $lock` # this one is empty,
>   kill -0 $pid# and this one fails
>   and thus a "just being created" one is considered stale
> 
> There are other problems around "stale pid file detection",
> but let's not go into that minefield right now.
> 
> > > Maybe we should change it to 
> > > ```
> > > while ! ( set -C; echo $pid > lockfile ); do
> > > if test -e lockfile ; then
> > > : error handling for existing lockfile, stale lockfile detection
> > > else
> > > : error handling for not being able to create lockfile
> > > fi
> > > done
> > > : only reached if lockfile was successfully created
> > > ```
> > > 
> > > (or use flock or other tools designed for that purpose)
> > 
> > flock would probably be the easiest. mkdir would do too, but for
> > upgrade issues.
> 
> and, being part of util-linux, flock should be available "everywhere".
> 
> but because writing "wrappers" around flock similar to the intended
> semantics of ocf_take_lock and ocf_release_lock_on_exit is not easy
> either, usually you'd be better of using flock directly in the RA.
> 
> so, still trying to do this with shell:
> 
> "set -C" (respectively set -o noclober):
>   If set, disallow existing regular files to be overwritten
>   by redirection of output.
> 
> normal '>' means: O_WRONLY|O_CREAT|O_TRUNC,
> set -C '>' means: O_WRONLY|O_CREAT|O_EXCL
> 
> using "set -C ; echo $$ > $lock" instead of 
> "test -e $lock || echo $$ > $lock"
> gets rid of the "creation race".
> 
> To get rid of the "still empty 

Re: [ClusterLabs] vip is not removed after node lost connection with the other two nodes

2017-06-26 Thread Jan Friesse

Jan Pokorný napsal(a):

[Hui, no need to address us individually along with the list, we are
both subscribed to it since around the beginning]

On 26/06/17 16:10 +0800, Hui Xiang wrote:

Thanks guys!!

@Ken
I did "ifconfig ethx down" to make the cluster interface down.


That's what I suspected and what I tried to show as problematic to say
the least, based on the previous dismay.


@Jan

Do you know what is the "known bug" mentioned below:

"During ifdown corosync will rebind to localhost (this is long time
known bug) and behaves weirdly."

http://lists.clusterlabs.org/pipermail/users/2015-July/000878.html


There wasn't much investigation on your side, was it?

https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface

Honza Friesse or Chrissie can comment more on this topic.


There is really nothing to comment. We are working on fix for Corosync 
3.x. Also even after fixing this bug it will be bad idea to test cluster 
recovery just by using ifdown.







___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] vip is not removed after node lost connection with the other two nodes

2017-06-26 Thread Jan Pokorný
[Hui, no need to address us individually along with the list, we are
both subscribed to it since around the beginning]

On 26/06/17 16:10 +0800, Hui Xiang wrote:
> Thanks guys!!
> 
> @Ken
> I did "ifconfig ethx down" to make the cluster interface down.

That's what I suspected and what I tried to show as problematic to say
the least, based on the previous dismay.

> @Jan
> 
> Do you know what is the "known bug" mentioned below:
> 
> "During ifdown corosync will rebind to localhost (this is long time
> known bug) and behaves weirdly."
> 
> http://lists.clusterlabs.org/pipermail/users/2015-July/000878.html

There wasn't much investigation on your side, was it?

https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface

Honza Friesse or Chrissie can comment more on this topic.

-- 
Jan (Poki)


pgp2_mpdH4tD4.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org