Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-10 Thread Joshua Harlow

Unsure what u mean here,

Tooz is already in oslo.

Where u thinking of something else?

D'Angelo, Scott wrote:

Could the work for the tooz variant be leveraged to add a truly distributed 
solution (with the proper tooz distributed backend)? IF so, then +1 to this 
idea. Cinder will be implementing a version of tooz based distribute locks, so 
having it in Olso someday is a goal I'd think.


From: Joshua Harlow [harlo...@fastmail.com]
Sent: Wednesday, December 09, 2015 6:13 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [oslo][all] The lock files saga (and where we can 
go from here)

So,

To try to reach some kind of conclusion here I am wondering if it would
be acceptable to folks (would people even adopt such a change?) if we
(oslo folks/others) provided a new function in say lockutils.py (in
oslo.concurrency) that would let users of oslo.concurrency pick which
kind of lock they would want to use...

The two types would be:

1. A pid based lock, which would *not* be resistant to crashing
processes, it would perhaps use
https://github.com/openstack/pylockfile/blob/master/lockfile/pidlockfile.py
internally. It would be more easily breakable and more easily
introspect-able (by either deleting the file or `cat` the file to see
the pid inside of it).
2. The existing lock that is resistant to crashing processes (it
automatically releases on owner process crash) but is not easily
introspect-able (to know who is using the lock) and is not easily
breakable (aka to forcefully break the lock and release waiters and the
current lock holder).

Would people use these two variants if (oslo) provided them, or would
the status quo exist and nothing much would change?

A third possibility is to spend energy using/integrating tooz
distributed locks and treating different processes on the same system as
distributed instances [even though they really are not distributed in
the classical sense]). These locks that tooz supports are already
introspect-able (via various means) and can be broken if needed (work is
in progress to make this breaking process more useable via API).

Thoughts?

-Josh

Clint Byrum wrote:

Excerpts from Joshua Harlow's message of 2015-12-01 09:28:18 -0800:

Sean Dague wrote:

On 12/01/2015 08:08 AM, Duncan Thomas wrote:

On 1 December 2015 at 13:40, Sean Dague<s...@dague.net
<mailto:s...@dague.net>>wrote:


   The current approach means locks block on their own, are processed in
   the order they come in, but deletes aren't possible. The busy lock would
   mean deletes were normal. Some extra cpu spent on waiting, and lock
   order processing would be non deterministic. It's trade offs, but I
   don't know anywhere that we are using locks as queues, so order
   shouldn't matter. The cpu cost on the busy wait versus the lock file
   cleanliness might be worth making. It would also let you actually see
   what's locked from the outside pretty easily.


The cinder locks are very much used as queues in places, e.g. making
delete wait until after an image operation finishes. Given that cinder
can already bring a node into resource issues while doing lots of image
operations concurrently (such as creating lots of bootable volumes at
once) I'd be resistant to anything that makes it worse to solve a
cosmetic issue.

Is that really a queue? Don't do X while Y is a lock. Do X, Y, Z, in
order after W is done is a queue. And what you've explains above about
Don't DELETE while DOING OTHER ACTION, is really just the queue model.

What I mean by treating locks as queues was depending on X, Y, Z
happening in that order after W. With a busy wait approach they might
happen as Y, Z, X or X, Z, B, Y. They will all happen after W is done.
But relative to each other, or to new ops coming in, no real order is
enforced.


So ummm, just so people know the fasteners lock code (and the stuff that
has existed for file locks in oslo.concurrency and prior to that
oslo-incubator...) never has guaranteed the aboved sequencing.

How it works (and has always worked) is the following:

1. A lock object is created
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L85)
2. That lock object acquire is performed
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L125)
3. At that point do_open is called to ensure the file exists (if it
exists already it is opened in append mode, so no overwrite happen) and
the lock object has a reference to the file descriptor of that file
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L112)
4. A retry loop starts, that repeats until either a provided timeout is
elapsed or the lock is acquired, the retry logic u can skip over but the
code that the retry loop calls is
https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L92

The retry loop (really this 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-10 Thread D'Angelo, Scott
Could the work for the tooz variant be leveraged to add a truly distributed 
solution (with the proper tooz distributed backend)? IF so, then +1 to this 
idea. Cinder will be implementing a version of tooz based distribute locks, so 
having it in Olso someday is a goal I'd think.


From: Joshua Harlow [harlo...@fastmail.com]
Sent: Wednesday, December 09, 2015 6:13 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [oslo][all] The lock files saga (and where we can 
go from here)

So,

To try to reach some kind of conclusion here I am wondering if it would
be acceptable to folks (would people even adopt such a change?) if we
(oslo folks/others) provided a new function in say lockutils.py (in
oslo.concurrency) that would let users of oslo.concurrency pick which
kind of lock they would want to use...

The two types would be:

1. A pid based lock, which would *not* be resistant to crashing
processes, it would perhaps use
https://github.com/openstack/pylockfile/blob/master/lockfile/pidlockfile.py
internally. It would be more easily breakable and more easily
introspect-able (by either deleting the file or `cat` the file to see
the pid inside of it).
2. The existing lock that is resistant to crashing processes (it
automatically releases on owner process crash) but is not easily
introspect-able (to know who is using the lock) and is not easily
breakable (aka to forcefully break the lock and release waiters and the
current lock holder).

Would people use these two variants if (oslo) provided them, or would
the status quo exist and nothing much would change?

A third possibility is to spend energy using/integrating tooz
distributed locks and treating different processes on the same system as
distributed instances [even though they really are not distributed in
the classical sense]). These locks that tooz supports are already
introspect-able (via various means) and can be broken if needed (work is
in progress to make this breaking process more useable via API).

Thoughts?

-Josh

Clint Byrum wrote:
> Excerpts from Joshua Harlow's message of 2015-12-01 09:28:18 -0800:
>> Sean Dague wrote:
>>> On 12/01/2015 08:08 AM, Duncan Thomas wrote:
>>>> On 1 December 2015 at 13:40, Sean Dague<s...@dague.net
>>>> <mailto:s...@dague.net>>   wrote:
>>>>
>>>>
>>>>   The current approach means locks block on their own, are processed in
>>>>   the order they come in, but deletes aren't possible. The busy lock 
>>>> would
>>>>   mean deletes were normal. Some extra cpu spent on waiting, and lock
>>>>   order processing would be non deterministic. It's trade offs, but I
>>>>   don't know anywhere that we are using locks as queues, so order
>>>>   shouldn't matter. The cpu cost on the busy wait versus the lock file
>>>>   cleanliness might be worth making. It would also let you actually see
>>>>   what's locked from the outside pretty easily.
>>>>
>>>>
>>>> The cinder locks are very much used as queues in places, e.g. making
>>>> delete wait until after an image operation finishes. Given that cinder
>>>> can already bring a node into resource issues while doing lots of image
>>>> operations concurrently (such as creating lots of bootable volumes at
>>>> once) I'd be resistant to anything that makes it worse to solve a
>>>> cosmetic issue.
>>> Is that really a queue? Don't do X while Y is a lock. Do X, Y, Z, in
>>> order after W is done is a queue. And what you've explains above about
>>> Don't DELETE while DOING OTHER ACTION, is really just the queue model.
>>>
>>> What I mean by treating locks as queues was depending on X, Y, Z
>>> happening in that order after W. With a busy wait approach they might
>>> happen as Y, Z, X or X, Z, B, Y. They will all happen after W is done.
>>> But relative to each other, or to new ops coming in, no real order is
>>> enforced.
>>>
>> So ummm, just so people know the fasteners lock code (and the stuff that
>> has existed for file locks in oslo.concurrency and prior to that
>> oslo-incubator...) never has guaranteed the aboved sequencing.
>>
>> How it works (and has always worked) is the following:
>>
>> 1. A lock object is created
>> (https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L85)
>> 2. That lock object acquire is performed
>> (https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L125)
>> 3. At that point do_open is called to ensure the file exists (if it
>> exists already it is opened in appe

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-09 Thread Joshua Harlow

So,

To try to reach some kind of conclusion here I am wondering if it would 
be acceptable to folks (would people even adopt such a change?) if we 
(oslo folks/others) provided a new function in say lockutils.py (in 
oslo.concurrency) that would let users of oslo.concurrency pick which 
kind of lock they would want to use...


The two types would be:

1. A pid based lock, which would *not* be resistant to crashing 
processes, it would perhaps use 
https://github.com/openstack/pylockfile/blob/master/lockfile/pidlockfile.py 
internally. It would be more easily breakable and more easily 
introspect-able (by either deleting the file or `cat` the file to see 
the pid inside of it).
2. The existing lock that is resistant to crashing processes (it 
automatically releases on owner process crash) but is not easily 
introspect-able (to know who is using the lock) and is not easily 
breakable (aka to forcefully break the lock and release waiters and the 
current lock holder).


Would people use these two variants if (oslo) provided them, or would 
the status quo exist and nothing much would change?


A third possibility is to spend energy using/integrating tooz 
distributed locks and treating different processes on the same system as 
distributed instances [even though they really are not distributed in 
the classical sense]). These locks that tooz supports are already 
introspect-able (via various means) and can be broken if needed (work is 
in progress to make this breaking process more useable via API).


Thoughts?

-Josh

Clint Byrum wrote:

Excerpts from Joshua Harlow's message of 2015-12-01 09:28:18 -0800:

Sean Dague wrote:

On 12/01/2015 08:08 AM, Duncan Thomas wrote:

On 1 December 2015 at 13:40, Sean Dague>   wrote:


  The current approach means locks block on their own, are processed in
  the order they come in, but deletes aren't possible. The busy lock would
  mean deletes were normal. Some extra cpu spent on waiting, and lock
  order processing would be non deterministic. It's trade offs, but I
  don't know anywhere that we are using locks as queues, so order
  shouldn't matter. The cpu cost on the busy wait versus the lock file
  cleanliness might be worth making. It would also let you actually see
  what's locked from the outside pretty easily.


The cinder locks are very much used as queues in places, e.g. making
delete wait until after an image operation finishes. Given that cinder
can already bring a node into resource issues while doing lots of image
operations concurrently (such as creating lots of bootable volumes at
once) I'd be resistant to anything that makes it worse to solve a
cosmetic issue.

Is that really a queue? Don't do X while Y is a lock. Do X, Y, Z, in
order after W is done is a queue. And what you've explains above about
Don't DELETE while DOING OTHER ACTION, is really just the queue model.

What I mean by treating locks as queues was depending on X, Y, Z
happening in that order after W. With a busy wait approach they might
happen as Y, Z, X or X, Z, B, Y. They will all happen after W is done.
But relative to each other, or to new ops coming in, no real order is
enforced.


So ummm, just so people know the fasteners lock code (and the stuff that
has existed for file locks in oslo.concurrency and prior to that
oslo-incubator...) never has guaranteed the aboved sequencing.

How it works (and has always worked) is the following:

1. A lock object is created
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L85)
2. That lock object acquire is performed
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L125)
3. At that point do_open is called to ensure the file exists (if it
exists already it is opened in append mode, so no overwrite happen) and
the lock object has a reference to the file descriptor of that file
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L112)
4. A retry loop starts, that repeats until either a provided timeout is
elapsed or the lock is acquired, the retry logic u can skip over but the
code that the retry loop calls is
https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L92

The retry loop (really this loop @
https://github.com/harlowja/fasteners/blob/master/fasteners/_utils.py#L87)
will idle for a given delay between the next attempt to lock the file,
so that means there is no queue like sequencing, and that if for example
entity A (who created lock object at t0) sleeps for 50 seconds between
delays and entity B (who created lock object at t1) and sleeps for 5
seconds between delays would prefer entity B getting it (since entity B
has a smaller retry delay).

So just fyi, I wouldn't be depending on these for queuing/ordering as is...



Agreed, this form of fcntl locking is basically equivalent to
O_CREAT|O_EXCL locks as Sean described, since we never use the blocking
form. I'm 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-01 Thread Sean Dague
On 12/01/2015 08:08 AM, Duncan Thomas wrote:
> 
> 
> On 1 December 2015 at 13:40, Sean Dague  > wrote:
> 
> 
> The current approach means locks block on their own, are processed in
> the order they come in, but deletes aren't possible. The busy lock would
> mean deletes were normal. Some extra cpu spent on waiting, and lock
> order processing would be non deterministic. It's trade offs, but I
> don't know anywhere that we are using locks as queues, so order
> shouldn't matter. The cpu cost on the busy wait versus the lock file
> cleanliness might be worth making. It would also let you actually see
> what's locked from the outside pretty easily.
> 
> 
> The cinder locks are very much used as queues in places, e.g. making
> delete wait until after an image operation finishes. Given that cinder
> can already bring a node into resource issues while doing lots of image
> operations concurrently (such as creating lots of bootable volumes at
> once) I'd be resistant to anything that makes it worse to solve a
> cosmetic issue.

Is that really a queue? Don't do X while Y is a lock. Do X, Y, Z, in
order after W is done is a queue. And what you've explains above about
Don't DELETE while DOING OTHER ACTION, is really just the queue model.

What I mean by treating locks as queues was depending on X, Y, Z
happening in that order after W. With a busy wait approach they might
happen as Y, Z, X or X, Z, B, Y. They will all happen after W is done.
But relative to each other, or to new ops coming in, no real order is
enforced.

-Sean

-- 
Sean Dague
http://dague.net

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-01 Thread Joshua Harlow

Sean Dague wrote:

On 12/01/2015 08:08 AM, Duncan Thomas wrote:


On 1 December 2015 at 13:40, Sean Dague>  wrote:


 The current approach means locks block on their own, are processed in
 the order they come in, but deletes aren't possible. The busy lock would
 mean deletes were normal. Some extra cpu spent on waiting, and lock
 order processing would be non deterministic. It's trade offs, but I
 don't know anywhere that we are using locks as queues, so order
 shouldn't matter. The cpu cost on the busy wait versus the lock file
 cleanliness might be worth making. It would also let you actually see
 what's locked from the outside pretty easily.


The cinder locks are very much used as queues in places, e.g. making
delete wait until after an image operation finishes. Given that cinder
can already bring a node into resource issues while doing lots of image
operations concurrently (such as creating lots of bootable volumes at
once) I'd be resistant to anything that makes it worse to solve a
cosmetic issue.


Is that really a queue? Don't do X while Y is a lock. Do X, Y, Z, in
order after W is done is a queue. And what you've explains above about
Don't DELETE while DOING OTHER ACTION, is really just the queue model.

What I mean by treating locks as queues was depending on X, Y, Z
happening in that order after W. With a busy wait approach they might
happen as Y, Z, X or X, Z, B, Y. They will all happen after W is done.
But relative to each other, or to new ops coming in, no real order is
enforced.



So ummm, just so people know the fasteners lock code (and the stuff that 
has existed for file locks in oslo.concurrency and prior to that 
oslo-incubator...) never has guaranteed the aboved sequencing.


How it works (and has always worked) is the following:

1. A lock object is created 
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L85)
2. That lock object acquire is performed 
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L125)
3. At that point do_open is called to ensure the file exists (if it 
exists already it is opened in append mode, so no overwrite happen) and 
the lock object has a reference to the file descriptor of that file 
(https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L112)
4. A retry loop starts, that repeats until either a provided timeout is 
elapsed or the lock is acquired, the retry logic u can skip over but the 
code that the retry loop calls is 
https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L92


The retry loop (really this loop @ 
https://github.com/harlowja/fasteners/blob/master/fasteners/_utils.py#L87) 
will idle for a given delay between the next attempt to lock the file, 
so that means there is no queue like sequencing, and that if for example 
entity A (who created lock object at t0) sleeps for 50 seconds between 
delays and entity B (who created lock object at t1) and sleeps for 5 
seconds between delays would prefer entity B getting it (since entity B 
has a smaller retry delay).


So just fyi, I wouldn't be depending on these for queuing/ordering as is...

-Josh


-Sean



__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-01 Thread Joshua Harlow

So my takeaway is we need each project to have something like:

https://gist.github.com/harlowja/b4f0ddadbda1f92cc1e2

That could possibly exist in oslo (I just threw it together) but the 
idea is that a thread/greenthread would run that 'run_forever' method in 
that code and it would periodically try to clean off locks by acquiring 
them (with a timeout for acquire) and then deleting the lock path that 
the lock is using (and then releasing the lock).


The problems with that are as mentioned previously, even when we acquire 
a lock (aka the cleaner gets the lock) and then delete the underlying 
file that does *not* release other entities trying to acquire that same 
lock file (especially ones that blocked themselves in there acquire() 
method before the deletion started) so that's where either we need to do 
something like sean stated or we need to IMHO get away from having a 
lock file that is deleted at all (and use byte-ranges inside a single 
lock file, that single lock file would never be deleted in the first 
place) or we need to get off file locks in the first place (but ya, 
that's like umm a bigger issue...)


Such a single lock file would then use something like the following to 
get locks from it:


class LockSharder(object):

def __init__(self, offset_locks):
self.offset_locks = offset_locks

def get_lock(self, name):
return self.offset_locks[hash(name) % len(self.offset_locks)]

So there are a few ideas...

Duncan Thomas wrote:



On 1 December 2015 at 13:40, Sean Dague > wrote:


The current approach means locks block on their own, are processed in
the order they come in, but deletes aren't possible. The busy lock would
mean deletes were normal. Some extra cpu spent on waiting, and lock
order processing would be non deterministic. It's trade offs, but I
don't know anywhere that we are using locks as queues, so order
shouldn't matter. The cpu cost on the busy wait versus the lock file
cleanliness might be worth making. It would also let you actually see
what's locked from the outside pretty easily.


The cinder locks are very much used as queues in places, e.g. making
delete wait until after an image operation finishes. Given that cinder
can already bring a node into resource issues while doing lots of image
operations concurrently (such as creating lots of bootable volumes at
once) I'd be resistant to anything that makes it worse to solve a
cosmetic issue.


--
Duncan Thomas

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-01 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-12-01 09:28:18 -0800:
> Sean Dague wrote:
> > On 12/01/2015 08:08 AM, Duncan Thomas wrote:
> >>
> >> On 1 December 2015 at 13:40, Sean Dague >> >  wrote:
> >>
> >>
> >>  The current approach means locks block on their own, are processed in
> >>  the order they come in, but deletes aren't possible. The busy lock 
> >> would
> >>  mean deletes were normal. Some extra cpu spent on waiting, and lock
> >>  order processing would be non deterministic. It's trade offs, but I
> >>  don't know anywhere that we are using locks as queues, so order
> >>  shouldn't matter. The cpu cost on the busy wait versus the lock file
> >>  cleanliness might be worth making. It would also let you actually see
> >>  what's locked from the outside pretty easily.
> >>
> >>
> >> The cinder locks are very much used as queues in places, e.g. making
> >> delete wait until after an image operation finishes. Given that cinder
> >> can already bring a node into resource issues while doing lots of image
> >> operations concurrently (such as creating lots of bootable volumes at
> >> once) I'd be resistant to anything that makes it worse to solve a
> >> cosmetic issue.
> >
> > Is that really a queue? Don't do X while Y is a lock. Do X, Y, Z, in
> > order after W is done is a queue. And what you've explains above about
> > Don't DELETE while DOING OTHER ACTION, is really just the queue model.
> >
> > What I mean by treating locks as queues was depending on X, Y, Z
> > happening in that order after W. With a busy wait approach they might
> > happen as Y, Z, X or X, Z, B, Y. They will all happen after W is done.
> > But relative to each other, or to new ops coming in, no real order is
> > enforced.
> >
> 
> So ummm, just so people know the fasteners lock code (and the stuff that 
> has existed for file locks in oslo.concurrency and prior to that 
> oslo-incubator...) never has guaranteed the aboved sequencing.
> 
> How it works (and has always worked) is the following:
> 
> 1. A lock object is created 
> (https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L85)
> 2. That lock object acquire is performed 
> (https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L125)
> 3. At that point do_open is called to ensure the file exists (if it 
> exists already it is opened in append mode, so no overwrite happen) and 
> the lock object has a reference to the file descriptor of that file 
> (https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L112)
> 4. A retry loop starts, that repeats until either a provided timeout is 
> elapsed or the lock is acquired, the retry logic u can skip over but the 
> code that the retry loop calls is 
> https://github.com/harlowja/fasteners/blob/master/fasteners/process_lock.py#L92
> 
> The retry loop (really this loop @ 
> https://github.com/harlowja/fasteners/blob/master/fasteners/_utils.py#L87) 
> will idle for a given delay between the next attempt to lock the file, 
> so that means there is no queue like sequencing, and that if for example 
> entity A (who created lock object at t0) sleeps for 50 seconds between 
> delays and entity B (who created lock object at t1) and sleeps for 5 
> seconds between delays would prefer entity B getting it (since entity B 
> has a smaller retry delay).
> 
> So just fyi, I wouldn't be depending on these for queuing/ordering as is...
> 

Agreed, this form of fcntl locking is basically equivalent to
O_CREAT|O_EXCL locks as Sean described, since we never use the blocking
form. I'm not sure why though. The main reason one uses fcntl/flock is
to go ahead and block so waiters queue up efficiently. I'd tend to agree
with Sean that if we're going to busy wait, just using creation locks
will be simpler.

That said, I think what is missing is the metadata for efficiently
cleaning up stale locks. That can be done with fcntl or creation locks,
but with fcntl you have the kernel telling you for sure if the locking
process is still alive when you want to clean up and take the lock. With
creation, you need to write that information into the lock, and remove
it, and then have a way to make sure the process is alive and knows it
has the lock, and that is not exactly simple. For this reason only, I
suggest staying with fcntl.

Beyond that, perhaps what is needed is a tool in oslo_concurrency or
fasteners which one can use to prune stale locks based on said metadata.
Once that exists, a cron job running that is the simplest answer. Or if
need be, let the daemons spawn processes periodically to do that (you
can't use a greenthread, since you may be cleaning up your own locks and
fcntl will gladly let a process re-lock something it already has locked).

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-01 Thread Sean Dague
On 11/30/2015 05:25 PM, Clint Byrum wrote:
> Excerpts from Ben Nemec's message of 2015-11-30 13:22:23 -0800:
>> On 11/30/2015 02:15 PM, Sean Dague wrote:
>>> On 11/30/2015 03:01 PM, Robert Collins wrote:
 On 1 December 2015 at 08:37, Ben Nemec  wrote:
> On 11/30/2015 12:42 PM, Joshua Harlow wrote:
>> Hi all,
>>
>> I just wanted to bring up an issue, possible solution and get feedback
>> on it from folks because it seems to be an on-going problem that shows
>> up not when an application is initially deployed but as on-going
>> operation and running of that application proceeds (ie after running for
>> a period of time).
>>
>> The jist of the problem is the following:
>>
>> A <> has a need to ensure that no
>> application on the same machine can manipulate a given resource on that
>> same machine, so it uses the lock file pattern (acquire a *local* lock
>> file for that resource, manipulate that resource, release that lock
>> file) to do actions on that resource in a safe manner (note this does
>> not ensure safety outside of that machine, lock files are *not*
>> distributed locks).
>>
>> The api that we expose from oslo is typically accessed via the following:
>>
>>oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
>> external=False, lock_path=None, semaphores=None, delay=0.01)
>>
>> or via its underlying library (that I extracted from oslo.concurrency
>> and have improved to add more usefulness) @
>> http://fasteners.readthedocs.org/
>>
>> The issue though for <> is that each of
>> these projects now typically has a large amount of lock files that exist
>> or have existed and no easy way to determine when those lock files can
>> be deleted (afaik no? periodic task exists in said projects to clean up
>> lock files, or to delete them when they are no longer in use...) so what
>> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
>> appear and there is no a simple solution to clean lock files up (since
>> oslo.concurrency is really not the right layer to know when a lock can
>> or can not be deleted, only the application knows that...)
>>
>> So then we get a few creative solutions like the following:
>>
>> - https://review.openstack.org/#/c/241663/
>> - https://review.openstack.org/#/c/239678/
>> - (and others?)
>>
>> So I wanted to ask the question, how are people involved in <> favorite openstack project>> cleaning up these files (are they at all?)
>>
>> Another idea that I have been proposing also is to use offset locks.
>>
>> This would allow for not creating X lock files, but create a *single*
>> lock file per project and use offsets into it as the way to lock. For
>> example nova could/would create a 1MB (or larger/smaller) *empty* file
>> for locks, that would allow for 1,048,576 locks to be used at the same
>> time, which honestly should be way more than enough, and then there
>> would not need to be any lock cleanup at all... Is there any reason this
>> wasn't initially done back way when this lock file code was created?
>> (https://github.com/harlowja/fasteners/pull/10 adds this functionality
>> to the underlying library if people want to look it over)
>
> I think the main reason was that even with a million locks available,
> you'd have to find a way to hash the lock names to offsets in the file,
> and a million isn't a very large collision space for that.  Having two
> differently named locks that hashed to the same offset would lead to
> incredibly confusing bugs.
>
> We could switch to requiring the projects to provide the offsets instead
> of hashing a string value, but that's just pushing the collision problem
> off onto every project that uses us.
>
> So that's the problem as I understand it, but where does that leave us
> for solutions?  First, there's
> https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
> which allows consumers to delete lock files when they're done with them.
>  Of course, in that case the onus is on the caller to make sure the lock
> couldn't possibly be in use anymore.
>
> Second, is this actually a problem?  Modern filesystems have absurdly
> large limits on the number of files in a directory, so it's highly
> unlikely we would ever exhaust that, and we're creating all zero byte
> files so there shouldn't be a significant space impact either.  In the
> past I believe our recommendation has been to simply create a cleanup
> job that runs on boot, before any of the OpenStack services start, that
> deletes all of the lock files.  At that point you know it's safe to
> delete them, and it prevents your lock file directory from growing 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-12-01 Thread Duncan Thomas
On 1 December 2015 at 13:40, Sean Dague  wrote:

>
> The current approach means locks block on their own, are processed in
> the order they come in, but deletes aren't possible. The busy lock would
> mean deletes were normal. Some extra cpu spent on waiting, and lock
> order processing would be non deterministic. It's trade offs, but I
> don't know anywhere that we are using locks as queues, so order
> shouldn't matter. The cpu cost on the busy wait versus the lock file
> cleanliness might be worth making. It would also let you actually see
> what's locked from the outside pretty easily.
>
>
The cinder locks are very much used as queues in places, e.g. making delete
wait until after an image operation finishes. Given that cinder can already
bring a node into resource issues while doing lots of image operations
concurrently (such as creating lots of bootable volumes at once) I'd be
resistant to anything that makes it worse to solve a cosmetic issue.


-- 
Duncan Thomas
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Ben Nemec
On 11/30/2015 02:15 PM, Sean Dague wrote:
> On 11/30/2015 03:01 PM, Robert Collins wrote:
>> On 1 December 2015 at 08:37, Ben Nemec  wrote:
>>> On 11/30/2015 12:42 PM, Joshua Harlow wrote:
 Hi all,

 I just wanted to bring up an issue, possible solution and get feedback
 on it from folks because it seems to be an on-going problem that shows
 up not when an application is initially deployed but as on-going
 operation and running of that application proceeds (ie after running for
 a period of time).

 The jist of the problem is the following:

 A <> has a need to ensure that no
 application on the same machine can manipulate a given resource on that
 same machine, so it uses the lock file pattern (acquire a *local* lock
 file for that resource, manipulate that resource, release that lock
 file) to do actions on that resource in a safe manner (note this does
 not ensure safety outside of that machine, lock files are *not*
 distributed locks).

 The api that we expose from oslo is typically accessed via the following:

oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
 external=False, lock_path=None, semaphores=None, delay=0.01)

 or via its underlying library (that I extracted from oslo.concurrency
 and have improved to add more usefulness) @
 http://fasteners.readthedocs.org/

 The issue though for <> is that each of
 these projects now typically has a large amount of lock files that exist
 or have existed and no easy way to determine when those lock files can
 be deleted (afaik no? periodic task exists in said projects to clean up
 lock files, or to delete them when they are no longer in use...) so what
 happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
 appear and there is no a simple solution to clean lock files up (since
 oslo.concurrency is really not the right layer to know when a lock can
 or can not be deleted, only the application knows that...)

 So then we get a few creative solutions like the following:

 - https://review.openstack.org/#/c/241663/
 - https://review.openstack.org/#/c/239678/
 - (and others?)

 So I wanted to ask the question, how are people involved in <>>> favorite openstack project>> cleaning up these files (are they at all?)

 Another idea that I have been proposing also is to use offset locks.

 This would allow for not creating X lock files, but create a *single*
 lock file per project and use offsets into it as the way to lock. For
 example nova could/would create a 1MB (or larger/smaller) *empty* file
 for locks, that would allow for 1,048,576 locks to be used at the same
 time, which honestly should be way more than enough, and then there
 would not need to be any lock cleanup at all... Is there any reason this
 wasn't initially done back way when this lock file code was created?
 (https://github.com/harlowja/fasteners/pull/10 adds this functionality
 to the underlying library if people want to look it over)
>>>
>>> I think the main reason was that even with a million locks available,
>>> you'd have to find a way to hash the lock names to offsets in the file,
>>> and a million isn't a very large collision space for that.  Having two
>>> differently named locks that hashed to the same offset would lead to
>>> incredibly confusing bugs.
>>>
>>> We could switch to requiring the projects to provide the offsets instead
>>> of hashing a string value, but that's just pushing the collision problem
>>> off onto every project that uses us.
>>>
>>> So that's the problem as I understand it, but where does that leave us
>>> for solutions?  First, there's
>>> https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
>>> which allows consumers to delete lock files when they're done with them.
>>>  Of course, in that case the onus is on the caller to make sure the lock
>>> couldn't possibly be in use anymore.
>>>
>>> Second, is this actually a problem?  Modern filesystems have absurdly
>>> large limits on the number of files in a directory, so it's highly
>>> unlikely we would ever exhaust that, and we're creating all zero byte
>>> files so there shouldn't be a significant space impact either.  In the
>>> past I believe our recommendation has been to simply create a cleanup
>>> job that runs on boot, before any of the OpenStack services start, that
>>> deletes all of the lock files.  At that point you know it's safe to
>>> delete them, and it prevents your lock file directory from growing forever.
>>
>> Not that high - ext3 (still the default for nova ephemeral
>> partitions!) has a limit of 64k in one directory.
>>
>> That said, I don't disagree - my thinkis is that we should advise
>> putting such files on a tmpfs.
> 
> So, I think the issue really is that the named 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Joshua Harlow

Ben Nemec wrote:

On 11/30/2015 02:15 PM, Sean Dague wrote:

On 11/30/2015 03:01 PM, Robert Collins wrote:

On 1 December 2015 at 08:37, Ben Nemec  wrote:

On 11/30/2015 12:42 PM, Joshua Harlow wrote:

Hi all,

I just wanted to bring up an issue, possible solution and get feedback
on it from folks because it seems to be an on-going problem that shows
up not when an application is initially deployed but as on-going
operation and running of that application proceeds (ie after running for
a period of time).

The jist of the problem is the following:

A<>  has a need to ensure that no
application on the same machine can manipulate a given resource on that
same machine, so it uses the lock file pattern (acquire a *local* lock
file for that resource, manipulate that resource, release that lock
file) to do actions on that resource in a safe manner (note this does
not ensure safety outside of that machine, lock files are *not*
distributed locks).

The api that we expose from oslo is typically accessed via the following:

oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
external=False, lock_path=None, semaphores=None, delay=0.01)

or via its underlying library (that I extracted from oslo.concurrency
and have improved to add more usefulness) @
http://fasteners.readthedocs.org/

The issue though for<>  is that each of
these projects now typically has a large amount of lock files that exist
or have existed and no easy way to determine when those lock files can
be deleted (afaik no? periodic task exists in said projects to clean up
lock files, or to delete them when they are no longer in use...) so what
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
appear and there is no a simple solution to clean lock files up (since
oslo.concurrency is really not the right layer to know when a lock can
or can not be deleted, only the application knows that...)

So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in<>  cleaning up these files (are they at all?)

Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single*
lock file per project and use offsets into it as the way to lock. For
example nova could/would create a 1MB (or larger/smaller) *empty* file
for locks, that would allow for 1,048,576 locks to be used at the same
time, which honestly should be way more than enough, and then there
would not need to be any lock cleanup at all... Is there any reason this
wasn't initially done back way when this lock file code was created?
(https://github.com/harlowja/fasteners/pull/10 adds this functionality
to the underlying library if people want to look it over)

I think the main reason was that even with a million locks available,
you'd have to find a way to hash the lock names to offsets in the file,
and a million isn't a very large collision space for that.  Having two
differently named locks that hashed to the same offset would lead to
incredibly confusing bugs.

We could switch to requiring the projects to provide the offsets instead
of hashing a string value, but that's just pushing the collision problem
off onto every project that uses us.

So that's the problem as I understand it, but where does that leave us
for solutions?  First, there's
https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
which allows consumers to delete lock files when they're done with them.
  Of course, in that case the onus is on the caller to make sure the lock
couldn't possibly be in use anymore.

Second, is this actually a problem?  Modern filesystems have absurdly
large limits on the number of files in a directory, so it's highly
unlikely we would ever exhaust that, and we're creating all zero byte
files so there shouldn't be a significant space impact either.  In the
past I believe our recommendation has been to simply create a cleanup
job that runs on boot, before any of the OpenStack services start, that
deletes all of the lock files.  At that point you know it's safe to
delete them, and it prevents your lock file directory from growing forever.

Not that high - ext3 (still the default for nova ephemeral
partitions!) has a limit of 64k in one directory.

That said, I don't disagree - my thinkis is that we should advise
putting such files on a tmpfs.

So, I think the issue really is that the named external locks were
originally thought to be handling some pretty sensitive critical
sections. Both cinder / nova have less than 20 such named locks.

Cinder uses a parametrized version for all volume operations -
https://github.com/openstack/cinder/blob/7fb767f2d652f070a20fd70d92585d61e56f3a50/cinder/volume/manager.py#L143


Nova also does something similar in image cache

[openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Joshua Harlow

Hi all,

I just wanted to bring up an issue, possible solution and get feedback 
on it from folks because it seems to be an on-going problem that shows 
up not when an application is initially deployed but as on-going 
operation and running of that application proceeds (ie after running for 
a period of time).


The jist of the problem is the following:

A <> has a need to ensure that no 
application on the same machine can manipulate a given resource on that 
same machine, so it uses the lock file pattern (acquire a *local* lock 
file for that resource, manipulate that resource, release that lock 
file) to do actions on that resource in a safe manner (note this does 
not ensure safety outside of that machine, lock files are *not* 
distributed locks).


The api that we expose from oslo is typically accessed via the following:

  oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None, 
external=False, lock_path=None, semaphores=None, delay=0.01)


or via its underlying library (that I extracted from oslo.concurrency 
and have improved to add more usefulness) @ 
http://fasteners.readthedocs.org/


The issue though for <> is that each of 
these projects now typically has a large amount of lock files that exist 
or have existed and no easy way to determine when those lock files can 
be deleted (afaik no? periodic task exists in said projects to clean up 
lock files, or to delete them when they are no longer in use...) so what 
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387 
appear and there is no a simple solution to clean lock files up (since 
oslo.concurrency is really not the right layer to know when a lock can 
or can not be deleted, only the application knows that...)


So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in > cleaning up these files (are they at all?)


Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single* 
lock file per project and use offsets into it as the way to lock. For 
example nova could/would create a 1MB (or larger/smaller) *empty* file 
for locks, that would allow for 1,048,576 locks to be used at the same 
time, which honestly should be way more than enough, and then there 
would not need to be any lock cleanup at all... Is there any reason this 
wasn't initially done back way when this lock file code was created? 
(https://github.com/harlowja/fasteners/pull/10 adds this functionality 
to the underlying library if people want to look it over)


In general would like to hear peoples thoughts/ideas/complaints/other,

-Josh

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Robert Collins
On 1 December 2015 at 08:37, Ben Nemec  wrote:
> On 11/30/2015 12:42 PM, Joshua Harlow wrote:
>> Hi all,
>>
>> I just wanted to bring up an issue, possible solution and get feedback
>> on it from folks because it seems to be an on-going problem that shows
>> up not when an application is initially deployed but as on-going
>> operation and running of that application proceeds (ie after running for
>> a period of time).
>>
>> The jist of the problem is the following:
>>
>> A <> has a need to ensure that no
>> application on the same machine can manipulate a given resource on that
>> same machine, so it uses the lock file pattern (acquire a *local* lock
>> file for that resource, manipulate that resource, release that lock
>> file) to do actions on that resource in a safe manner (note this does
>> not ensure safety outside of that machine, lock files are *not*
>> distributed locks).
>>
>> The api that we expose from oslo is typically accessed via the following:
>>
>>oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
>> external=False, lock_path=None, semaphores=None, delay=0.01)
>>
>> or via its underlying library (that I extracted from oslo.concurrency
>> and have improved to add more usefulness) @
>> http://fasteners.readthedocs.org/
>>
>> The issue though for <> is that each of
>> these projects now typically has a large amount of lock files that exist
>> or have existed and no easy way to determine when those lock files can
>> be deleted (afaik no? periodic task exists in said projects to clean up
>> lock files, or to delete them when they are no longer in use...) so what
>> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
>> appear and there is no a simple solution to clean lock files up (since
>> oslo.concurrency is really not the right layer to know when a lock can
>> or can not be deleted, only the application knows that...)
>>
>> So then we get a few creative solutions like the following:
>>
>> - https://review.openstack.org/#/c/241663/
>> - https://review.openstack.org/#/c/239678/
>> - (and others?)
>>
>> So I wanted to ask the question, how are people involved in <> favorite openstack project>> cleaning up these files (are they at all?)
>>
>> Another idea that I have been proposing also is to use offset locks.
>>
>> This would allow for not creating X lock files, but create a *single*
>> lock file per project and use offsets into it as the way to lock. For
>> example nova could/would create a 1MB (or larger/smaller) *empty* file
>> for locks, that would allow for 1,048,576 locks to be used at the same
>> time, which honestly should be way more than enough, and then there
>> would not need to be any lock cleanup at all... Is there any reason this
>> wasn't initially done back way when this lock file code was created?
>> (https://github.com/harlowja/fasteners/pull/10 adds this functionality
>> to the underlying library if people want to look it over)
>
> I think the main reason was that even with a million locks available,
> you'd have to find a way to hash the lock names to offsets in the file,
> and a million isn't a very large collision space for that.  Having two
> differently named locks that hashed to the same offset would lead to
> incredibly confusing bugs.
>
> We could switch to requiring the projects to provide the offsets instead
> of hashing a string value, but that's just pushing the collision problem
> off onto every project that uses us.
>
> So that's the problem as I understand it, but where does that leave us
> for solutions?  First, there's
> https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
> which allows consumers to delete lock files when they're done with them.
>  Of course, in that case the onus is on the caller to make sure the lock
> couldn't possibly be in use anymore.
>
> Second, is this actually a problem?  Modern filesystems have absurdly
> large limits on the number of files in a directory, so it's highly
> unlikely we would ever exhaust that, and we're creating all zero byte
> files so there shouldn't be a significant space impact either.  In the
> past I believe our recommendation has been to simply create a cleanup
> job that runs on boot, before any of the OpenStack services start, that
> deletes all of the lock files.  At that point you know it's safe to
> delete them, and it prevents your lock file directory from growing forever.

Not that high - ext3 (still the default for nova ephemeral
partitions!) has a limit of 64k in one directory.

That said, I don't disagree - my thinkis is that we should advise
putting such files on a tmpfs.

-Rob

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Clint Byrum
Excerpts from Joshua Harlow's message of 2015-11-30 10:42:53 -0800:
> Hi all,
> 
> I just wanted to bring up an issue, possible solution and get feedback 
> on it from folks because it seems to be an on-going problem that shows 
> up not when an application is initially deployed but as on-going 
> operation and running of that application proceeds (ie after running for 
> a period of time).
> 
> The jist of the problem is the following:
> 
> A <> has a need to ensure that no 
> application on the same machine can manipulate a given resource on that 
> same machine, so it uses the lock file pattern (acquire a *local* lock 
> file for that resource, manipulate that resource, release that lock 
> file) to do actions on that resource in a safe manner (note this does 
> not ensure safety outside of that machine, lock files are *not* 
> distributed locks).
> 
> The api that we expose from oslo is typically accessed via the following:
> 
>oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None, 
> external=False, lock_path=None, semaphores=None, delay=0.01)
> 
> or via its underlying library (that I extracted from oslo.concurrency 
> and have improved to add more usefulness) @ 
> http://fasteners.readthedocs.org/
> 
> The issue though for <> is that each of 
> these projects now typically has a large amount of lock files that exist 
> or have existed and no easy way to determine when those lock files can 
> be deleted (afaik no? periodic task exists in said projects to clean up 
> lock files, or to delete them when they are no longer in use...) so what 
> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387 
> appear and there is no a simple solution to clean lock files up (since 
> oslo.concurrency is really not the right layer to know when a lock can 
> or can not be deleted, only the application knows that...)
> 
> So then we get a few creative solutions like the following:
> 
> - https://review.openstack.org/#/c/241663/
> - https://review.openstack.org/#/c/239678/
> - (and others?)
> 
> So I wanted to ask the question, how are people involved in < favorite openstack project>> cleaning up these files (are they at all?)
> 
> Another idea that I have been proposing also is to use offset locks.
> 
> This would allow for not creating X lock files, but create a *single* 
> lock file per project and use offsets into it as the way to lock. For 
> example nova could/would create a 1MB (or larger/smaller) *empty* file 
> for locks, that would allow for 1,048,576 locks to be used at the same 
> time, which honestly should be way more than enough, and then there 
> would not need to be any lock cleanup at all... Is there any reason this 
> wasn't initially done back way when this lock file code was created? 
> (https://github.com/harlowja/fasteners/pull/10 adds this functionality 
> to the underlying library if people want to look it over)

This is really complicated, and basically just makes the directory of
lock files _look_ clean. But it still leaves each offset stale, and has
to be cleaned anyway.

Fasteners already has process locks that use fcntl/flock.

These locks provide enough to allow you to infer things about.  the owner
of the lock file. If there's no process still holding the exclusive lock
when you try to lock it, then YOU own it, and thus control the resource.

A cron job which tries to flock anything older than ${REASONABLE_TIME}
and deletes them seems fine. Whatever process was trying to interact
with the resource is gone at that point.

Now, anything that needs to safely manage a resource beyond without a
live process will need to keep track of its own state and be idempotent
anyway. IMO this isn't something lock files alone solve well. I believe
you're familiar with a library named taskflow that is supposed to help
write code that does this better ;). Even without taskflow, if you are
trying to do something exclusive without a single process that stays
alive, you need to do _something_ to keep track of state and restart
or revert that flow. That is a state management problem, not a locking
problem.

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Ben Nemec
On 11/30/2015 12:42 PM, Joshua Harlow wrote:
> Hi all,
> 
> I just wanted to bring up an issue, possible solution and get feedback 
> on it from folks because it seems to be an on-going problem that shows 
> up not when an application is initially deployed but as on-going 
> operation and running of that application proceeds (ie after running for 
> a period of time).
> 
> The jist of the problem is the following:
> 
> A <> has a need to ensure that no 
> application on the same machine can manipulate a given resource on that 
> same machine, so it uses the lock file pattern (acquire a *local* lock 
> file for that resource, manipulate that resource, release that lock 
> file) to do actions on that resource in a safe manner (note this does 
> not ensure safety outside of that machine, lock files are *not* 
> distributed locks).
> 
> The api that we expose from oslo is typically accessed via the following:
> 
>oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None, 
> external=False, lock_path=None, semaphores=None, delay=0.01)
> 
> or via its underlying library (that I extracted from oslo.concurrency 
> and have improved to add more usefulness) @ 
> http://fasteners.readthedocs.org/
> 
> The issue though for <> is that each of 
> these projects now typically has a large amount of lock files that exist 
> or have existed and no easy way to determine when those lock files can 
> be deleted (afaik no? periodic task exists in said projects to clean up 
> lock files, or to delete them when they are no longer in use...) so what 
> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387 
> appear and there is no a simple solution to clean lock files up (since 
> oslo.concurrency is really not the right layer to know when a lock can 
> or can not be deleted, only the application knows that...)
> 
> So then we get a few creative solutions like the following:
> 
> - https://review.openstack.org/#/c/241663/
> - https://review.openstack.org/#/c/239678/
> - (and others?)
> 
> So I wanted to ask the question, how are people involved in < favorite openstack project>> cleaning up these files (are they at all?)
> 
> Another idea that I have been proposing also is to use offset locks.
> 
> This would allow for not creating X lock files, but create a *single* 
> lock file per project and use offsets into it as the way to lock. For 
> example nova could/would create a 1MB (or larger/smaller) *empty* file 
> for locks, that would allow for 1,048,576 locks to be used at the same 
> time, which honestly should be way more than enough, and then there 
> would not need to be any lock cleanup at all... Is there any reason this 
> wasn't initially done back way when this lock file code was created? 
> (https://github.com/harlowja/fasteners/pull/10 adds this functionality 
> to the underlying library if people want to look it over)

I think the main reason was that even with a million locks available,
you'd have to find a way to hash the lock names to offsets in the file,
and a million isn't a very large collision space for that.  Having two
differently named locks that hashed to the same offset would lead to
incredibly confusing bugs.

We could switch to requiring the projects to provide the offsets instead
of hashing a string value, but that's just pushing the collision problem
off onto every project that uses us.

So that's the problem as I understand it, but where does that leave us
for solutions?  First, there's
https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
which allows consumers to delete lock files when they're done with them.
 Of course, in that case the onus is on the caller to make sure the lock
couldn't possibly be in use anymore.

Second, is this actually a problem?  Modern filesystems have absurdly
large limits on the number of files in a directory, so it's highly
unlikely we would ever exhaust that, and we're creating all zero byte
files so there shouldn't be a significant space impact either.  In the
past I believe our recommendation has been to simply create a cleanup
job that runs on boot, before any of the OpenStack services start, that
deletes all of the lock files.  At that point you know it's safe to
delete them, and it prevents your lock file directory from growing forever.

I know we've had this discussion in the past, but I don't think anyone
has ever told me that having lock files hang around was a functional
problem for them.  It seems to be largely cosmetic complaints about not
cleaning up the old files (which, as you noted, Oslo can't really solve
because we have no idea when consumers are finished with locks) and
given the amount of trouble we've had with interprocess locking in the
past I've never felt that a cosmetic issue was sufficient reason to
reopen that can of worms.  I'll just note again that every time we've
started messing with this stuff we run into a bunch of sticky problems
and edge cases, so it would take a pretty 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Joshua Harlow

Ben Nemec wrote:

On 11/30/2015 12:42 PM, Joshua Harlow wrote:

Hi all,

I just wanted to bring up an issue, possible solution and get feedback
on it from folks because it seems to be an on-going problem that shows
up not when an application is initially deployed but as on-going
operation and running of that application proceeds (ie after running for
a period of time).

The jist of the problem is the following:

A<>  has a need to ensure that no
application on the same machine can manipulate a given resource on that
same machine, so it uses the lock file pattern (acquire a *local* lock
file for that resource, manipulate that resource, release that lock
file) to do actions on that resource in a safe manner (note this does
not ensure safety outside of that machine, lock files are *not*
distributed locks).

The api that we expose from oslo is typically accessed via the following:

oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
external=False, lock_path=None, semaphores=None, delay=0.01)

or via its underlying library (that I extracted from oslo.concurrency
and have improved to add more usefulness) @
http://fasteners.readthedocs.org/

The issue though for<>  is that each of
these projects now typically has a large amount of lock files that exist
or have existed and no easy way to determine when those lock files can
be deleted (afaik no? periodic task exists in said projects to clean up
lock files, or to delete them when they are no longer in use...) so what
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
appear and there is no a simple solution to clean lock files up (since
oslo.concurrency is really not the right layer to know when a lock can
or can not be deleted, only the application knows that...)

So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in<>  cleaning up these files (are they at all?)

Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single*
lock file per project and use offsets into it as the way to lock. For
example nova could/would create a 1MB (or larger/smaller) *empty* file
for locks, that would allow for 1,048,576 locks to be used at the same
time, which honestly should be way more than enough, and then there
would not need to be any lock cleanup at all... Is there any reason this
wasn't initially done back way when this lock file code was created?
(https://github.com/harlowja/fasteners/pull/10 adds this functionality
to the underlying library if people want to look it over)


I think the main reason was that even with a million locks available,
you'd have to find a way to hash the lock names to offsets in the file,
and a million isn't a very large collision space for that.  Having two
differently named locks that hashed to the same offset would lead to
incredibly confusing bugs.

We could switch to requiring the projects to provide the offsets instead
of hashing a string value, but that's just pushing the collision problem
off onto every project that uses us.

So that's the problem as I understand it, but where does that leave us
for solutions?  First, there's
https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
which allows consumers to delete lock files when they're done with them.
  Of course, in that case the onus is on the caller to make sure the lock
couldn't possibly be in use anymore.


Ya, I wonder how many folks are actually doing this, because the exposed 
API of @synchronized doesn't seem to tell u what file to even delete in 
the first place :-/ perhaps we should make that more accessible so that 
people/consumers of that code could know what to delete...




Second, is this actually a problem?  Modern filesystems have absurdly
large limits on the number of files in a directory, so it's highly
unlikely we would ever exhaust that, and we're creating all zero byte
files so there shouldn't be a significant space impact either.  In the
past I believe our recommendation has been to simply create a cleanup
job that runs on boot, before any of the OpenStack services start, that
deletes all of the lock files.  At that point you know it's safe to
delete them, and it prevents your lock file directory from growing forever.


Except as we move to never shutting an app down (always online and live 
upgrades and all that jazz), it will have to run more than just on boot, 
but point taken.




I know we've had this discussion in the past, but I don't think anyone
has ever told me that having lock files hang around was a functional
problem for them.  It seems to be largely cosmetic complaints about not
cleaning up the old files (which, as you noted, Oslo can't really solve
because we have no idea when consumers are finished with locks) and
given the 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Sean Dague
On 11/30/2015 03:01 PM, Robert Collins wrote:
> On 1 December 2015 at 08:37, Ben Nemec  wrote:
>> On 11/30/2015 12:42 PM, Joshua Harlow wrote:
>>> Hi all,
>>>
>>> I just wanted to bring up an issue, possible solution and get feedback
>>> on it from folks because it seems to be an on-going problem that shows
>>> up not when an application is initially deployed but as on-going
>>> operation and running of that application proceeds (ie after running for
>>> a period of time).
>>>
>>> The jist of the problem is the following:
>>>
>>> A <> has a need to ensure that no
>>> application on the same machine can manipulate a given resource on that
>>> same machine, so it uses the lock file pattern (acquire a *local* lock
>>> file for that resource, manipulate that resource, release that lock
>>> file) to do actions on that resource in a safe manner (note this does
>>> not ensure safety outside of that machine, lock files are *not*
>>> distributed locks).
>>>
>>> The api that we expose from oslo is typically accessed via the following:
>>>
>>>oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
>>> external=False, lock_path=None, semaphores=None, delay=0.01)
>>>
>>> or via its underlying library (that I extracted from oslo.concurrency
>>> and have improved to add more usefulness) @
>>> http://fasteners.readthedocs.org/
>>>
>>> The issue though for <> is that each of
>>> these projects now typically has a large amount of lock files that exist
>>> or have existed and no easy way to determine when those lock files can
>>> be deleted (afaik no? periodic task exists in said projects to clean up
>>> lock files, or to delete them when they are no longer in use...) so what
>>> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
>>> appear and there is no a simple solution to clean lock files up (since
>>> oslo.concurrency is really not the right layer to know when a lock can
>>> or can not be deleted, only the application knows that...)
>>>
>>> So then we get a few creative solutions like the following:
>>>
>>> - https://review.openstack.org/#/c/241663/
>>> - https://review.openstack.org/#/c/239678/
>>> - (and others?)
>>>
>>> So I wanted to ask the question, how are people involved in <>> favorite openstack project>> cleaning up these files (are they at all?)
>>>
>>> Another idea that I have been proposing also is to use offset locks.
>>>
>>> This would allow for not creating X lock files, but create a *single*
>>> lock file per project and use offsets into it as the way to lock. For
>>> example nova could/would create a 1MB (or larger/smaller) *empty* file
>>> for locks, that would allow for 1,048,576 locks to be used at the same
>>> time, which honestly should be way more than enough, and then there
>>> would not need to be any lock cleanup at all... Is there any reason this
>>> wasn't initially done back way when this lock file code was created?
>>> (https://github.com/harlowja/fasteners/pull/10 adds this functionality
>>> to the underlying library if people want to look it over)
>>
>> I think the main reason was that even with a million locks available,
>> you'd have to find a way to hash the lock names to offsets in the file,
>> and a million isn't a very large collision space for that.  Having two
>> differently named locks that hashed to the same offset would lead to
>> incredibly confusing bugs.
>>
>> We could switch to requiring the projects to provide the offsets instead
>> of hashing a string value, but that's just pushing the collision problem
>> off onto every project that uses us.
>>
>> So that's the problem as I understand it, but where does that leave us
>> for solutions?  First, there's
>> https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
>> which allows consumers to delete lock files when they're done with them.
>>  Of course, in that case the onus is on the caller to make sure the lock
>> couldn't possibly be in use anymore.
>>
>> Second, is this actually a problem?  Modern filesystems have absurdly
>> large limits on the number of files in a directory, so it's highly
>> unlikely we would ever exhaust that, and we're creating all zero byte
>> files so there shouldn't be a significant space impact either.  In the
>> past I believe our recommendation has been to simply create a cleanup
>> job that runs on boot, before any of the OpenStack services start, that
>> deletes all of the lock files.  At that point you know it's safe to
>> delete them, and it prevents your lock file directory from growing forever.
> 
> Not that high - ext3 (still the default for nova ephemeral
> partitions!) has a limit of 64k in one directory.
> 
> That said, I don't disagree - my thinkis is that we should advise
> putting such files on a tmpfs.

So, I think the issue really is that the named external locks were
originally thought to be handling some pretty sensitive critical
sections. Both cinder / nova have less than 20 such 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Joshua Harlow

Joshua Harlow wrote:

Ben Nemec wrote:

On 11/30/2015 12:42 PM, Joshua Harlow wrote:

Hi all,

I just wanted to bring up an issue, possible solution and get feedback
on it from folks because it seems to be an on-going problem that shows
up not when an application is initially deployed but as on-going
operation and running of that application proceeds (ie after running for
a period of time).

The jist of the problem is the following:

A<> has a need to ensure that no
application on the same machine can manipulate a given resource on that
same machine, so it uses the lock file pattern (acquire a *local* lock
file for that resource, manipulate that resource, release that lock
file) to do actions on that resource in a safe manner (note this does
not ensure safety outside of that machine, lock files are *not*
distributed locks).

The api that we expose from oslo is typically accessed via the
following:

oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
external=False, lock_path=None, semaphores=None, delay=0.01)

or via its underlying library (that I extracted from oslo.concurrency
and have improved to add more usefulness) @
http://fasteners.readthedocs.org/

The issue though for<> is that each of
these projects now typically has a large amount of lock files that exist
or have existed and no easy way to determine when those lock files can
be deleted (afaik no? periodic task exists in said projects to clean up
lock files, or to delete them when they are no longer in use...) so what
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
appear and there is no a simple solution to clean lock files up (since
oslo.concurrency is really not the right layer to know when a lock can
or can not be deleted, only the application knows that...)

So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in<> cleaning up these files (are they at all?)


From some simple greps using:

$ echo "Removal usage in" $(basename `pwd`); grep -R 
remove_external_lock_file *


Removal usage in cinder


Removal usage in nova
nova/virt/libvirt/imagecache.py: 
lockutils.remove_external_lock_file(lock_file,


Removal usage in glance


Removal usage in neutron


So me thinks people aren't cleaning any of these up :-/



Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single*
lock file per project and use offsets into it as the way to lock. For
example nova could/would create a 1MB (or larger/smaller) *empty* file
for locks, that would allow for 1,048,576 locks to be used at the same
time, which honestly should be way more than enough, and then there
would not need to be any lock cleanup at all... Is there any reason this
wasn't initially done back way when this lock file code was created?
(https://github.com/harlowja/fasteners/pull/10 adds this functionality
to the underlying library if people want to look it over)


I think the main reason was that even with a million locks available,
you'd have to find a way to hash the lock names to offsets in the file,
and a million isn't a very large collision space for that. Having two
differently named locks that hashed to the same offset would lead to
incredibly confusing bugs.

We could switch to requiring the projects to provide the offsets instead
of hashing a string value, but that's just pushing the collision problem
off onto every project that uses us.

So that's the problem as I understand it, but where does that leave us
for solutions? First, there's
https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151

which allows consumers to delete lock files when they're done with them.
Of course, in that case the onus is on the caller to make sure the lock
couldn't possibly be in use anymore.


Ya, I wonder how many folks are actually doing this, because the exposed
API of @synchronized doesn't seem to tell u what file to even delete in
the first place :-/ perhaps we should make that more accessible so that
people/consumers of that code could know what to delete...



Second, is this actually a problem? Modern filesystems have absurdly
large limits on the number of files in a directory, so it's highly
unlikely we would ever exhaust that, and we're creating all zero byte
files so there shouldn't be a significant space impact either. In the
past I believe our recommendation has been to simply create a cleanup
job that runs on boot, before any of the OpenStack services start, that
deletes all of the lock files. At that point you know it's safe to
delete them, and it prevents your lock file directory from growing
forever.


Except as we move to never shutting an app down (always online and live
upgrades and all that jazz), it will have to run more than just on boot,
but point taken.



I know 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Joshua Harlow

Clint Byrum wrote:

Excerpts from Joshua Harlow's message of 2015-11-30 10:42:53 -0800:

Hi all,

I just wanted to bring up an issue, possible solution and get feedback
on it from folks because it seems to be an on-going problem that shows
up not when an application is initially deployed but as on-going
operation and running of that application proceeds (ie after running for
a period of time).

The jist of the problem is the following:

A<>  has a need to ensure that no
application on the same machine can manipulate a given resource on that
same machine, so it uses the lock file pattern (acquire a *local* lock
file for that resource, manipulate that resource, release that lock
file) to do actions on that resource in a safe manner (note this does
not ensure safety outside of that machine, lock files are *not*
distributed locks).

The api that we expose from oslo is typically accessed via the following:

oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
external=False, lock_path=None, semaphores=None, delay=0.01)

or via its underlying library (that I extracted from oslo.concurrency
and have improved to add more usefulness) @
http://fasteners.readthedocs.org/

The issue though for<>  is that each of
these projects now typically has a large amount of lock files that exist
or have existed and no easy way to determine when those lock files can
be deleted (afaik no? periodic task exists in said projects to clean up
lock files, or to delete them when they are no longer in use...) so what
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
appear and there is no a simple solution to clean lock files up (since
oslo.concurrency is really not the right layer to know when a lock can
or can not be deleted, only the application knows that...)

So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in<>  cleaning up these files (are they at all?)

Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single*
lock file per project and use offsets into it as the way to lock. For
example nova could/would create a 1MB (or larger/smaller) *empty* file
for locks, that would allow for 1,048,576 locks to be used at the same
time, which honestly should be way more than enough, and then there
would not need to be any lock cleanup at all... Is there any reason this
wasn't initially done back way when this lock file code was created?
(https://github.com/harlowja/fasteners/pull/10 adds this functionality
to the underlying library if people want to look it over)


This is really complicated, and basically just makes the directory of
lock files _look_ clean. But it still leaves each offset stale, and has
to be cleaned anyway.


What do u mean here (out of curiosity), each offset stale? The file 
would basically never change size after startup (pick a large enough 
number, 10 million, 1 trillion billion...) and use it appropriately from 
there on out...




Fasteners already has process locks that use fcntl/flock.

These locks provide enough to allow you to infer things about.  the owner
of the lock file. If there's no process still holding the exclusive lock
when you try to lock it, then YOU own it, and thus control the resource.


Well not really, python doesn't expose the ability to introspect who has 
the handle afaik, I  tried to look into that and it looks like fnctl 
(the C api) might have a way to get it, but u can't really introspect 
that, without as u stated, acquiring the lock yourself... I can try to 
recall more of this investigation when I was trying to add a @owner_pid 
property onto fasteners interprocess lock class but from my simple 
memory the exposed API isn't there in python.




A cron job which tries to flock anything older than ${REASONABLE_TIME}
and deletes them seems fine. Whatever process was trying to interact
with the resource is gone at that point.


Yes, or a periodic thread in the application that can do this in a safe 
manner (using its ability to know exactly what its own apps internals 
are doing...)




Now, anything that needs to safely manage a resource beyond without a
live process will need to keep track of its own state and be idempotent
anyway. IMO this isn't something lock files alone solve well. I believe
you're familiar with a library named taskflow that is supposed to help
write code that does this better ;). Even without taskflow, if you are
trying to do something exclusive without a single process that stays
alive, you need to do _something_ to keep track of state and restart
or revert that flow. That is a state management problem, not a locking
problem.



Agreed. ;)


__
OpenStack Development Mailing List (not for usage questions)

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Joshua Harlow

Sean Dague wrote:

On 11/30/2015 03:01 PM, Robert Collins wrote:

On 1 December 2015 at 08:37, Ben Nemec  wrote:

On 11/30/2015 12:42 PM, Joshua Harlow wrote:

Hi all,

I just wanted to bring up an issue, possible solution and get feedback
on it from folks because it seems to be an on-going problem that shows
up not when an application is initially deployed but as on-going
operation and running of that application proceeds (ie after running for
a period of time).

The jist of the problem is the following:

A<>  has a need to ensure that no
application on the same machine can manipulate a given resource on that
same machine, so it uses the lock file pattern (acquire a *local* lock
file for that resource, manipulate that resource, release that lock
file) to do actions on that resource in a safe manner (note this does
not ensure safety outside of that machine, lock files are *not*
distributed locks).

The api that we expose from oslo is typically accessed via the following:

oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
external=False, lock_path=None, semaphores=None, delay=0.01)

or via its underlying library (that I extracted from oslo.concurrency
and have improved to add more usefulness) @
http://fasteners.readthedocs.org/

The issue though for<>  is that each of
these projects now typically has a large amount of lock files that exist
or have existed and no easy way to determine when those lock files can
be deleted (afaik no? periodic task exists in said projects to clean up
lock files, or to delete them when they are no longer in use...) so what
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
appear and there is no a simple solution to clean lock files up (since
oslo.concurrency is really not the right layer to know when a lock can
or can not be deleted, only the application knows that...)

So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in<>  cleaning up these files (are they at all?)

Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single*
lock file per project and use offsets into it as the way to lock. For
example nova could/would create a 1MB (or larger/smaller) *empty* file
for locks, that would allow for 1,048,576 locks to be used at the same
time, which honestly should be way more than enough, and then there
would not need to be any lock cleanup at all... Is there any reason this
wasn't initially done back way when this lock file code was created?
(https://github.com/harlowja/fasteners/pull/10 adds this functionality
to the underlying library if people want to look it over)

I think the main reason was that even with a million locks available,
you'd have to find a way to hash the lock names to offsets in the file,
and a million isn't a very large collision space for that.  Having two
differently named locks that hashed to the same offset would lead to
incredibly confusing bugs.

We could switch to requiring the projects to provide the offsets instead
of hashing a string value, but that's just pushing the collision problem
off onto every project that uses us.

So that's the problem as I understand it, but where does that leave us
for solutions?  First, there's
https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
which allows consumers to delete lock files when they're done with them.
  Of course, in that case the onus is on the caller to make sure the lock
couldn't possibly be in use anymore.

Second, is this actually a problem?  Modern filesystems have absurdly
large limits on the number of files in a directory, so it's highly
unlikely we would ever exhaust that, and we're creating all zero byte
files so there shouldn't be a significant space impact either.  In the
past I believe our recommendation has been to simply create a cleanup
job that runs on boot, before any of the OpenStack services start, that
deletes all of the lock files.  At that point you know it's safe to
delete them, and it prevents your lock file directory from growing forever.

Not that high - ext3 (still the default for nova ephemeral
partitions!) has a limit of 64k in one directory.

That said, I don't disagree - my thinkis is that we should advise
putting such files on a tmpfs.


So, I think the issue really is that the named external locks were
originally thought to be handling some pretty sensitive critical
sections. Both cinder / nova have less than 20 such named locks.

Cinder uses a parametrized version for all volume operations -
https://github.com/openstack/cinder/blob/7fb767f2d652f070a20fd70d92585d61e56f3a50/cinder/volume/manager.py#L143


Nova also does something similar in image cache

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Ben Nemec
On 11/30/2015 01:57 PM, Joshua Harlow wrote:
> Ben Nemec wrote:
>> On 11/30/2015 12:42 PM, Joshua Harlow wrote:
>>> Hi all,
>>>
>>> I just wanted to bring up an issue, possible solution and get feedback
>>> on it from folks because it seems to be an on-going problem that shows
>>> up not when an application is initially deployed but as on-going
>>> operation and running of that application proceeds (ie after running for
>>> a period of time).
>>>
>>> The jist of the problem is the following:
>>>
>>> A<>  has a need to ensure that no
>>> application on the same machine can manipulate a given resource on that
>>> same machine, so it uses the lock file pattern (acquire a *local* lock
>>> file for that resource, manipulate that resource, release that lock
>>> file) to do actions on that resource in a safe manner (note this does
>>> not ensure safety outside of that machine, lock files are *not*
>>> distributed locks).
>>>
>>> The api that we expose from oslo is typically accessed via the following:
>>>
>>> oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
>>> external=False, lock_path=None, semaphores=None, delay=0.01)
>>>
>>> or via its underlying library (that I extracted from oslo.concurrency
>>> and have improved to add more usefulness) @
>>> http://fasteners.readthedocs.org/
>>>
>>> The issue though for<>  is that each of
>>> these projects now typically has a large amount of lock files that exist
>>> or have existed and no easy way to determine when those lock files can
>>> be deleted (afaik no? periodic task exists in said projects to clean up
>>> lock files, or to delete them when they are no longer in use...) so what
>>> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
>>> appear and there is no a simple solution to clean lock files up (since
>>> oslo.concurrency is really not the right layer to know when a lock can
>>> or can not be deleted, only the application knows that...)
>>>
>>> So then we get a few creative solutions like the following:
>>>
>>> - https://review.openstack.org/#/c/241663/
>>> - https://review.openstack.org/#/c/239678/
>>> - (and others?)
>>>
>>> So I wanted to ask the question, how are people involved in<>> favorite openstack project>>  cleaning up these files (are they at all?)
>>>
>>> Another idea that I have been proposing also is to use offset locks.
>>>
>>> This would allow for not creating X lock files, but create a *single*
>>> lock file per project and use offsets into it as the way to lock. For
>>> example nova could/would create a 1MB (or larger/smaller) *empty* file
>>> for locks, that would allow for 1,048,576 locks to be used at the same
>>> time, which honestly should be way more than enough, and then there
>>> would not need to be any lock cleanup at all... Is there any reason this
>>> wasn't initially done back way when this lock file code was created?
>>> (https://github.com/harlowja/fasteners/pull/10 adds this functionality
>>> to the underlying library if people want to look it over)
>>
>> I think the main reason was that even with a million locks available,
>> you'd have to find a way to hash the lock names to offsets in the file,
>> and a million isn't a very large collision space for that.  Having two
>> differently named locks that hashed to the same offset would lead to
>> incredibly confusing bugs.
>>
>> We could switch to requiring the projects to provide the offsets instead
>> of hashing a string value, but that's just pushing the collision problem
>> off onto every project that uses us.
>>
>> So that's the problem as I understand it, but where does that leave us
>> for solutions?  First, there's
>> https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
>> which allows consumers to delete lock files when they're done with them.
>>   Of course, in that case the onus is on the caller to make sure the lock
>> couldn't possibly be in use anymore.
> 
> Ya, I wonder how many folks are actually doing this, because the exposed 
> API of @synchronized doesn't seem to tell u what file to even delete in 
> the first place :-/ perhaps we should make that more accessible so that 
> people/consumers of that code could know what to delete...

I'm not opposed to allowing users to clean up lock files, although I
think the docstrings for the methods should be very clear that it isn't
strictly necessary and it must be done carefully to avoid deleting
in-use files (the existing docstring is actually insufficient IMHO, but
I'm pretty sure I reviewed it when it went in so I have no one else to
blame ;-).

> 
>>
>> Second, is this actually a problem?  Modern filesystems have absurdly
>> large limits on the number of files in a directory, so it's highly
>> unlikely we would ever exhaust that, and we're creating all zero byte
>> files so there shouldn't be a significant space impact either.  In the
>> past I believe our recommendation has been to simply create a cleanup
>> job that runs on boot, 

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Joshua Harlow

Sean Dague wrote:

On 11/30/2015 03:01 PM, Robert Collins wrote:

On 1 December 2015 at 08:37, Ben Nemec  wrote:

On 11/30/2015 12:42 PM, Joshua Harlow wrote:

Hi all,

I just wanted to bring up an issue, possible solution and get feedback
on it from folks because it seems to be an on-going problem that shows
up not when an application is initially deployed but as on-going
operation and running of that application proceeds (ie after running for
a period of time).

The jist of the problem is the following:

A<>  has a need to ensure that no
application on the same machine can manipulate a given resource on that
same machine, so it uses the lock file pattern (acquire a *local* lock
file for that resource, manipulate that resource, release that lock
file) to do actions on that resource in a safe manner (note this does
not ensure safety outside of that machine, lock files are *not*
distributed locks).

The api that we expose from oslo is typically accessed via the following:

oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
external=False, lock_path=None, semaphores=None, delay=0.01)

or via its underlying library (that I extracted from oslo.concurrency
and have improved to add more usefulness) @
http://fasteners.readthedocs.org/

The issue though for<>  is that each of
these projects now typically has a large amount of lock files that exist
or have existed and no easy way to determine when those lock files can
be deleted (afaik no? periodic task exists in said projects to clean up
lock files, or to delete them when they are no longer in use...) so what
happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
appear and there is no a simple solution to clean lock files up (since
oslo.concurrency is really not the right layer to know when a lock can
or can not be deleted, only the application knows that...)

So then we get a few creative solutions like the following:

- https://review.openstack.org/#/c/241663/
- https://review.openstack.org/#/c/239678/
- (and others?)

So I wanted to ask the question, how are people involved in<>  cleaning up these files (are they at all?)

Another idea that I have been proposing also is to use offset locks.

This would allow for not creating X lock files, but create a *single*
lock file per project and use offsets into it as the way to lock. For
example nova could/would create a 1MB (or larger/smaller) *empty* file
for locks, that would allow for 1,048,576 locks to be used at the same
time, which honestly should be way more than enough, and then there
would not need to be any lock cleanup at all... Is there any reason this
wasn't initially done back way when this lock file code was created?
(https://github.com/harlowja/fasteners/pull/10 adds this functionality
to the underlying library if people want to look it over)

I think the main reason was that even with a million locks available,
you'd have to find a way to hash the lock names to offsets in the file,
and a million isn't a very large collision space for that.  Having two
differently named locks that hashed to the same offset would lead to
incredibly confusing bugs.

We could switch to requiring the projects to provide the offsets instead
of hashing a string value, but that's just pushing the collision problem
off onto every project that uses us.

So that's the problem as I understand it, but where does that leave us
for solutions?  First, there's
https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
which allows consumers to delete lock files when they're done with them.
  Of course, in that case the onus is on the caller to make sure the lock
couldn't possibly be in use anymore.

Second, is this actually a problem?  Modern filesystems have absurdly
large limits on the number of files in a directory, so it's highly
unlikely we would ever exhaust that, and we're creating all zero byte
files so there shouldn't be a significant space impact either.  In the
past I believe our recommendation has been to simply create a cleanup
job that runs on boot, before any of the OpenStack services start, that
deletes all of the lock files.  At that point you know it's safe to
delete them, and it prevents your lock file directory from growing forever.

Not that high - ext3 (still the default for nova ephemeral
partitions!) has a limit of 64k in one directory.

That said, I don't disagree - my thinkis is that we should advise
putting such files on a tmpfs.


So, I think the issue really is that the named external locks were
originally thought to be handling some pretty sensitive critical
sections. Both cinder / nova have less than 20 such named locks.

Cinder uses a parametrized version for all volume operations -
https://github.com/openstack/cinder/blob/7fb767f2d652f070a20fd70d92585d61e56f3a50/cinder/volume/manager.py#L143


Nova also does something similar in image cache

Re: [openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

2015-11-30 Thread Clint Byrum
Excerpts from Ben Nemec's message of 2015-11-30 13:22:23 -0800:
> On 11/30/2015 02:15 PM, Sean Dague wrote:
> > On 11/30/2015 03:01 PM, Robert Collins wrote:
> >> On 1 December 2015 at 08:37, Ben Nemec  wrote:
> >>> On 11/30/2015 12:42 PM, Joshua Harlow wrote:
>  Hi all,
> 
>  I just wanted to bring up an issue, possible solution and get feedback
>  on it from folks because it seems to be an on-going problem that shows
>  up not when an application is initially deployed but as on-going
>  operation and running of that application proceeds (ie after running for
>  a period of time).
> 
>  The jist of the problem is the following:
> 
>  A <> has a need to ensure that no
>  application on the same machine can manipulate a given resource on that
>  same machine, so it uses the lock file pattern (acquire a *local* lock
>  file for that resource, manipulate that resource, release that lock
>  file) to do actions on that resource in a safe manner (note this does
>  not ensure safety outside of that machine, lock files are *not*
>  distributed locks).
> 
>  The api that we expose from oslo is typically accessed via the following:
> 
> oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
>  external=False, lock_path=None, semaphores=None, delay=0.01)
> 
>  or via its underlying library (that I extracted from oslo.concurrency
>  and have improved to add more usefulness) @
>  http://fasteners.readthedocs.org/
> 
>  The issue though for <> is that each of
>  these projects now typically has a large amount of lock files that exist
>  or have existed and no easy way to determine when those lock files can
>  be deleted (afaik no? periodic task exists in said projects to clean up
>  lock files, or to delete them when they are no longer in use...) so what
>  happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
>  appear and there is no a simple solution to clean lock files up (since
>  oslo.concurrency is really not the right layer to know when a lock can
>  or can not be deleted, only the application knows that...)
> 
>  So then we get a few creative solutions like the following:
> 
>  - https://review.openstack.org/#/c/241663/
>  - https://review.openstack.org/#/c/239678/
>  - (and others?)
> 
>  So I wanted to ask the question, how are people involved in <  favorite openstack project>> cleaning up these files (are they at all?)
> 
>  Another idea that I have been proposing also is to use offset locks.
> 
>  This would allow for not creating X lock files, but create a *single*
>  lock file per project and use offsets into it as the way to lock. For
>  example nova could/would create a 1MB (or larger/smaller) *empty* file
>  for locks, that would allow for 1,048,576 locks to be used at the same
>  time, which honestly should be way more than enough, and then there
>  would not need to be any lock cleanup at all... Is there any reason this
>  wasn't initially done back way when this lock file code was created?
>  (https://github.com/harlowja/fasteners/pull/10 adds this functionality
>  to the underlying library if people want to look it over)
> >>>
> >>> I think the main reason was that even with a million locks available,
> >>> you'd have to find a way to hash the lock names to offsets in the file,
> >>> and a million isn't a very large collision space for that.  Having two
> >>> differently named locks that hashed to the same offset would lead to
> >>> incredibly confusing bugs.
> >>>
> >>> We could switch to requiring the projects to provide the offsets instead
> >>> of hashing a string value, but that's just pushing the collision problem
> >>> off onto every project that uses us.
> >>>
> >>> So that's the problem as I understand it, but where does that leave us
> >>> for solutions?  First, there's
> >>> https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
> >>> which allows consumers to delete lock files when they're done with them.
> >>>  Of course, in that case the onus is on the caller to make sure the lock
> >>> couldn't possibly be in use anymore.
> >>>
> >>> Second, is this actually a problem?  Modern filesystems have absurdly
> >>> large limits on the number of files in a directory, so it's highly
> >>> unlikely we would ever exhaust that, and we're creating all zero byte
> >>> files so there shouldn't be a significant space impact either.  In the
> >>> past I believe our recommendation has been to simply create a cleanup
> >>> job that runs on boot, before any of the OpenStack services start, that
> >>> deletes all of the lock files.  At that point you know it's safe to
> >>> delete them, and it prevents your lock file directory from growing 
> >>> forever.
> >>
> >> Not that high - ext3