Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-30 Thread Paul Walmsley
On Wed, 13 Mar 2013, Jeff Layton wrote:

> Of course, this is all a lot of work, and not something we can shove
> into the kernel for 3.9 at this point. In the meantime, while Mandeep's
> warning is correctly pointing out a problem, I think we ought to back
> it out until we can fix this properly.

Revert posted at:

http://marc.info/?l=linux-kernel=136468829626184=2

Andrew or Linus, care to pick this up before for 3.9-rc?


- Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-30 Thread Paul Walmsley
On Wed, 13 Mar 2013, Jeff Layton wrote:

 Of course, this is all a lot of work, and not something we can shove
 into the kernel for 3.9 at this point. In the meantime, while Mandeep's
 warning is correctly pointing out a problem, I think we ought to back
 it out until we can fix this properly.

Revert posted at:

http://marc.info/?l=linux-kernelm=136468829626184w=2

Andrew or Linus, care to pick this up before for 3.9-rc?


- Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-13 Thread Jeff Layton
On Wed, 6 Mar 2013 13:40:16 -0800
Tejun Heo  wrote:

> On Wed, Mar 06, 2013 at 01:36:36PM -0800, Tejun Heo wrote:
> > On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> > > So I do agree that we probably have *too* many of the stupid "let's
> > > check if we can freeze", and I suspect that the NFS code should get
> > > rid of the "freezable_schedule()" that is causing this warning
> > > (because I also agree that you should *not* freeze while holding
> > > locks, because it really can cause deadlocks), but I do suspect that
> > > network filesystems do need to have a few places where they check for
> > > freezing on their own... Exactly because freezing isn't *quite* like a
> > > signal.
> > 
> > Well, I don't really know much about nfs so I can't really tell, but
> > for most other cases, dealing with freezing like a signal should work
> > fine from what I've seen although I can't be sure before actually
> > trying.  Trond, Bruce, can you guys please chime in?
> 
> So, I think the question here would be, in nfs, how many of the
> current freezer check points would be difficult to conver to signal
> handling model after excluding the ones which are performed while
> holding some locks which we need to get rid of anyway?
> 

I think we can do this, but it isn't trivial. Here's what I'd envision,
but there are still many details that would need to be worked out...

Basically what we'd need is a way to go into TASK_KILLABLE sleep, but
still allow the freezer to wake these processes up. I guess that likely
means we need a new sleep state (TASK_FREEZABLE?).

We'd also need a way to return an -ERESTARTSYS like error (-EFREEZING?)
that tells the upper syscall handling layers to freeze the task and
then restart the syscall after waking back up. Maybe we don't need a
new error at all and -ERESTARTSYS is fine here? We also need to
consider the effects vs. audit code here, but that may be OK with the
overhaul that Al merged a few months ago.

Assuming we have those, then we need to fix up the NFS and RPC layers
to use this stuff:

1/ Prior to submitting a new RPC, we'd look and see whether "current"
is being frozen. If it is, then return -EFREEZING immediately without
doing anything.

2/ We might also need to retrofit certain stages in the RPC FSM to
return -EFREEZING too if it's a synchronous RPC and the task is being
frozen.

3/ A task is waiting for an RPC reply on an async RPC, we'd need to use
this new sleep state. If the process wakes up because something wants
it to freeze, then have it go back to sleep for a short period of time
to try and wait for the reply (up to 15s or so?). If we get the reply,
great -- return to userland and freeze the task there. If the reply
never comes in, give up on it and just return -EFREEZE and hope for the
best.

We might have to make this latter behavior contingent on a new mount
option (maybe "-o slushy" like Trond recommended). The current "hard"
and "soft" semantics don't really fit this situation correctly.

Of course, this is all a lot of work, and not something we can shove
into the kernel for 3.9 at this point. In the meantime, while Mandeep's
warning is correctly pointing out a problem, I think we ought to back
it out until we can fix this properly.

We're already getting a ton of reports on the mailing lists and in the
fedora bug tracker for this warning. Part of the problem is the
verbiage -- "BUG" makes people think "Oops", but this is really just a
warning.

We should also note that this is a problem too in the CIFS code since
it uses a similar mechanism for allowing the kernel to suspend while
waiting on SMB replies.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-13 Thread Jeff Layton
On Wed, 6 Mar 2013 13:40:16 -0800
Tejun Heo t...@kernel.org wrote:

 On Wed, Mar 06, 2013 at 01:36:36PM -0800, Tejun Heo wrote:
  On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
   So I do agree that we probably have *too* many of the stupid let's
   check if we can freeze, and I suspect that the NFS code should get
   rid of the freezable_schedule() that is causing this warning
   (because I also agree that you should *not* freeze while holding
   locks, because it really can cause deadlocks), but I do suspect that
   network filesystems do need to have a few places where they check for
   freezing on their own... Exactly because freezing isn't *quite* like a
   signal.
  
  Well, I don't really know much about nfs so I can't really tell, but
  for most other cases, dealing with freezing like a signal should work
  fine from what I've seen although I can't be sure before actually
  trying.  Trond, Bruce, can you guys please chime in?
 
 So, I think the question here would be, in nfs, how many of the
 current freezer check points would be difficult to conver to signal
 handling model after excluding the ones which are performed while
 holding some locks which we need to get rid of anyway?
 

I think we can do this, but it isn't trivial. Here's what I'd envision,
but there are still many details that would need to be worked out...

Basically what we'd need is a way to go into TASK_KILLABLE sleep, but
still allow the freezer to wake these processes up. I guess that likely
means we need a new sleep state (TASK_FREEZABLE?).

We'd also need a way to return an -ERESTARTSYS like error (-EFREEZING?)
that tells the upper syscall handling layers to freeze the task and
then restart the syscall after waking back up. Maybe we don't need a
new error at all and -ERESTARTSYS is fine here? We also need to
consider the effects vs. audit code here, but that may be OK with the
overhaul that Al merged a few months ago.

Assuming we have those, then we need to fix up the NFS and RPC layers
to use this stuff:

1/ Prior to submitting a new RPC, we'd look and see whether current
is being frozen. If it is, then return -EFREEZING immediately without
doing anything.

2/ We might also need to retrofit certain stages in the RPC FSM to
return -EFREEZING too if it's a synchronous RPC and the task is being
frozen.

3/ A task is waiting for an RPC reply on an async RPC, we'd need to use
this new sleep state. If the process wakes up because something wants
it to freeze, then have it go back to sleep for a short period of time
to try and wait for the reply (up to 15s or so?). If we get the reply,
great -- return to userland and freeze the task there. If the reply
never comes in, give up on it and just return -EFREEZE and hope for the
best.

We might have to make this latter behavior contingent on a new mount
option (maybe -o slushy like Trond recommended). The current hard
and soft semantics don't really fit this situation correctly.

Of course, this is all a lot of work, and not something we can shove
into the kernel for 3.9 at this point. In the meantime, while Mandeep's
warning is correctly pointing out a problem, I think we ought to back
it out until we can fix this properly.

We're already getting a ton of reports on the mailing lists and in the
fedora bug tracker for this warning. Part of the problem is the
verbiage -- BUG makes people think Oops, but this is really just a
warning.

We should also note that this is a problem too in the CIFS code since
it uses a similar mechanism for allowing the kernel to suspend while
waiting on SMB replies.

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-08 Thread Ingo Molnar

* Linus Torvalds  wrote:

>  - the "freeezer for suspend/resume on a laptop"
> 
> [...]
> 
> The second one is unlikely to really use NFS anyway. [...]



Incidentally I use NFS to a file server on my laptop, over wifi, and I close 
the lid 
for the night. It's NFS mounted soft.

If it was mounted hard I wouldn't expect the lid close event to result in a 
successful suspend if wifi went down - but with soft I would be pretty sad in 
the 
morning if batteries drained if I closed the lid after there's a power outage 
which 
took down both wifi and the laptop power supply.

Closing the lid while on battery is a pretty strong command from the user.

Arguably this is a special case on several levels as on desktop distros it's 
not at 
all easy mount NFS shares (all point & click automation goes to Samba shares), 
just 
wanted to mention it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-08 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

  - the freeezer for suspend/resume on a laptop
 
 [...]
 
 The second one is unlikely to really use NFS anyway. [...]

me raises a hand

Incidentally I use NFS to a file server on my laptop, over wifi, and I close 
the lid 
for the night. It's NFS mounted soft.

If it was mounted hard I wouldn't expect the lid close event to result in a 
successful suspend if wifi went down - but with soft I would be pretty sad in 
the 
morning if batteries drained if I closed the lid after there's a power outage 
which 
took down both wifi and the laptop power supply.

Closing the lid while on battery is a pretty strong command from the user.

Arguably this is a special case on several levels as on desktop distros it's 
not at 
all easy mount NFS shares (all point  click automation goes to Samba shares), 
just 
wanted to mention it.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Jeff Layton
On Thu, 7 Mar 2013 17:16:12 +
"Myklebust, Trond"  wrote:

> On Thu, 2013-03-07 at 09:03 -0800, Linus Torvalds wrote:
> > On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
> >  wrote:
> > >
> > > The problem there is that we get into the whole 'hard' vs 'soft' mount
> > > problem. We're supposed to guarantee data integrity for 'hard' mounts,
> > > so no funny business is allowed. OTOH, 'soft' mounts time out and return
> > > EIO to the application anyway, and so shouldn't be a problem.
> > >
> > > Perhaps we could add a '-oslushy' mount option :-) that guarantees data
> > > integrity for all situations _except_ ENETDOWN/ENETUNREACH?
> > 
> > I do think we are probably over-analyzing this. It's not like people
> > who want freezing to work usually use flaky NFS. There's really two
> > main groups:
> > 
> >  - the "freezer as a snapshot mechanism" that might use NFS because
> > they are in a server environment.
> > 
> >  - the "freeezer for suspend/resume on a laptop"
> > 
> > The first one does use NFS, and cares about it, and probably would
> > prefer the freeze event to take longer and finish for all ongoing IO
> > operations. End result: just ditch the "freezable_schedule()"
> > entirely.
> > 
> > The second one is unlikely to really use NFS anyway. End result:
> > ditching the freezable_schedule() is probably perfectly fine, even if
> > it would cause suspend failures if the network is being troublesome.
> > 
> > So for now, why not just replace freezable_schedule() with plain
> > schedule() in the NFS code, and ignore it until somebody actually
> > complains about it, and then aim to try to do something more targeted
> > for that particular complaint?
> 
> We _have_ had complaints about the laptop suspension problem; that was
> why Jeff introduced freezable_schedule() in the first place. We've never
> had complaints about any problems involvinf cgroup_freeze. This is why
> our focus tends to be on the former, and why I'm more worried about
> laptop suspend regressions for any short term fixes.
> 

Right. I don't know of anyone using the cgroup_freeze stuff. OTOH, I've
seen *many* reports of people who complained because their machine
didn't suspend because of a busy NFS mount.

In fact, the problem is still not fully "solved" with the
freezable_schedule stuff we have so far anyway. Even after this
(admittedly crappy) solution went in, we still had reports of machines
that failed to suspend because other tasks were stuck in
wait_on_page_writeback() for NFS pages.

So, in principle I think we need to abandon the current approach. The
question is what should replace it.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Rafael J. Wysocki
On Thursday, March 07, 2013 08:25:10 AM Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
>  wrote:
> >
> > It _shouldn't_ be an interruption unless the filesystem can't make
> > progress.
> 
> So how can we tell? Calling "freezable_schedule()" if you're not ready
> to be frozen is not good. And nobody but the NFS code can know.
> 
> You might want to introduce some counter that counts number of
> outstanding non-interruptible events, and only call the "freezable"
> version if that counter is zero.
> 
> A better alternative might be to *never* call the freezable version.
> Because those freezable_*() things are really quite disgusting, and
> are wrong - they don't actually freeze the process, they say "I don't
> care if you freeze me while I sleep", and you might actually wake up
> *while* the system is being frozen. I think the whole concept is
> broken. Rafaei - comments?

Well, the only comment I have is that the source of these functions was
commit d310310c (Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to
block the freezer) and they were added because people were complaining that
they couldn't suspend while their NFS servers were not responding, IIRC.

I agree that they are not really nice.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Myklebust, Trond
On Thu, 2013-03-07 at 09:03 -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
>  wrote:
> >
> > The problem there is that we get into the whole 'hard' vs 'soft' mount
> > problem. We're supposed to guarantee data integrity for 'hard' mounts,
> > so no funny business is allowed. OTOH, 'soft' mounts time out and return
> > EIO to the application anyway, and so shouldn't be a problem.
> >
> > Perhaps we could add a '-oslushy' mount option :-) that guarantees data
> > integrity for all situations _except_ ENETDOWN/ENETUNREACH?
> 
> I do think we are probably over-analyzing this. It's not like people
> who want freezing to work usually use flaky NFS. There's really two
> main groups:
> 
>  - the "freezer as a snapshot mechanism" that might use NFS because
> they are in a server environment.
> 
>  - the "freeezer for suspend/resume on a laptop"
> 
> The first one does use NFS, and cares about it, and probably would
> prefer the freeze event to take longer and finish for all ongoing IO
> operations. End result: just ditch the "freezable_schedule()"
> entirely.
> 
> The second one is unlikely to really use NFS anyway. End result:
> ditching the freezable_schedule() is probably perfectly fine, even if
> it would cause suspend failures if the network is being troublesome.
> 
> So for now, why not just replace freezable_schedule() with plain
> schedule() in the NFS code, and ignore it until somebody actually
> complains about it, and then aim to try to do something more targeted
> for that particular complaint?

We _have_ had complaints about the laptop suspension problem; that was
why Jeff introduced freezable_schedule() in the first place. We've never
had complaints about any problems involvinf cgroup_freeze. This is why
our focus tends to be on the former, and why I'm more worried about
laptop suspend regressions for any short term fixes.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Linus Torvalds
On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
 wrote:
>
> The problem there is that we get into the whole 'hard' vs 'soft' mount
> problem. We're supposed to guarantee data integrity for 'hard' mounts,
> so no funny business is allowed. OTOH, 'soft' mounts time out and return
> EIO to the application anyway, and so shouldn't be a problem.
>
> Perhaps we could add a '-oslushy' mount option :-) that guarantees data
> integrity for all situations _except_ ENETDOWN/ENETUNREACH?

I do think we are probably over-analyzing this. It's not like people
who want freezing to work usually use flaky NFS. There's really two
main groups:

 - the "freezer as a snapshot mechanism" that might use NFS because
they are in a server environment.

 - the "freeezer for suspend/resume on a laptop"

The first one does use NFS, and cares about it, and probably would
prefer the freeze event to take longer and finish for all ongoing IO
operations. End result: just ditch the "freezable_schedule()"
entirely.

The second one is unlikely to really use NFS anyway. End result:
ditching the freezable_schedule() is probably perfectly fine, even if
it would cause suspend failures if the network is being troublesome.

So for now, why not just replace freezable_schedule() with plain
schedule() in the NFS code, and ignore it until somebody actually
complains about it, and then aim to try to do something more targeted
for that particular complaint?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Myklebust, Trond
On Thu, 2013-03-07 at 08:25 -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
>  wrote:
> >
> > It _shouldn't_ be an interruption unless the filesystem can't make
> > progress.
> 
> So how can we tell? Calling "freezable_schedule()" if you're not ready
> to be frozen is not good. And nobody but the NFS code can know.
> 
> You might want to introduce some counter that counts number of
> outstanding non-interruptible events, and only call the "freezable"
> version if that counter is zero.
> 
> A better alternative might be to *never* call the freezable version.
> Because those freezable_*() things are really quite disgusting, and
> are wrong - they don't actually freeze the process, they say "I don't
> care if you freeze me while I sleep", and you might actually wake up
> *while* the system is being frozen. I think the whole concept is
> broken. Rafaei - comments? The function really is crap, regardless of
> any unrelated NFS problems.
> 
> So what NFS could do instead is actually check the "do I need to
> freeze" flag, and if you need to freeze you consider it an abort - and
> do *not* try to continue. Just freeze, and then act as if the machine
> got rebooted as far as NFS was concerned. That should work anyway, no?
> 
> That does sound a lot more complex, though.

The problem there is that we get into the whole 'hard' vs 'soft' mount
problem. We're supposed to guarantee data integrity for 'hard' mounts,
so no funny business is allowed. OTOH, 'soft' mounts time out and return
EIO to the application anyway, and so shouldn't be a problem.

Perhaps we could add a '-oslushy' mount option :-) that guarantees data
integrity for all situations _except_ ENETDOWN/ENETUNREACH?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Linus Torvalds
On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
 wrote:
>
> It _shouldn't_ be an interruption unless the filesystem can't make
> progress.

So how can we tell? Calling "freezable_schedule()" if you're not ready
to be frozen is not good. And nobody but the NFS code can know.

You might want to introduce some counter that counts number of
outstanding non-interruptible events, and only call the "freezable"
version if that counter is zero.

A better alternative might be to *never* call the freezable version.
Because those freezable_*() things are really quite disgusting, and
are wrong - they don't actually freeze the process, they say "I don't
care if you freeze me while I sleep", and you might actually wake up
*while* the system is being frozen. I think the whole concept is
broken. Rafaei - comments? The function really is crap, regardless of
any unrelated NFS problems.

So what NFS could do instead is actually check the "do I need to
freeze" flag, and if you need to freeze you consider it an abort - and
do *not* try to continue. Just freeze, and then act as if the machine
got rebooted as far as NFS was concerned. That should work anyway, no?

That does sound a lot more complex, though.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Tejun Heo
Hello, Linus.

On Thu, Mar 07, 2013 at 07:55:39AM -0800, Linus Torvalds wrote:
> In other words, that suggestion not only introduces new problems (a),
> it's fundamentally broken anyway (b) *AND* it doesn't even solve
> anything, it just moves it around.

I don't think it's gonna solve the problems we're talking about but
the "moving around" part could be useful nonetheless - e.g. we don't
want to shut down network before nfs while suspending.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Myklebust, Trond
On Thu, 2013-03-07 at 07:55 -0800, Linus Torvalds wrote:
> On Thu, Mar 7, 2013 at 3:41 AM, Jeff Layton  wrote:
> >
> > I think Trond may be on the right track. We probably need some
> > mechanism to quiesce the filesystem ahead of any sort of freezer
> > event.
> 
> No, guys. That cannot work. It's a completely moronic idea. Let me
> count the way:
> 
>  (a) it's just another form of saying "lock". But since other things
> are (by definition) going on when it happens, it will just cause
> deadlocks.
> 
>  (b) the freeze event might not even be system-global. So *some*
> processes (a cgroup) might freeze, others would not. You can't shut
> off the filesystem just because some processes migth freeze.

That's the whole bloody problem in a nutshell.

We only want to freeze the filesystem when the network goes down, and in
that case we want time to clean up first. That's why it needs to be
initiated by something like NetworkManager _before_ the network is shut
down.

>  (c) it just moves the same issue somewhere else. If you have some
> operation that must be done under the lock, then such an operation
> must be completed before you've quiesced the filesystem, which is your
> whole point of that "quiesce" event. BUT THAT'S THE EXACT SAME ISSUE
> AS NOT ALLOWING THE FREEZE TO HAPPEN DURING THAT TIME.
> 
> In other words, that suggestion not only introduces new problems (a),
> it's fundamentally broken anyway (b) *AND* it doesn't even solve
> anything, it just moves it around.
> 
> The solution is damn simple: if you're in some kind of "atomic
> region", then you cannot freeze. Seriously. SO DON'T CALL
> "freezable_schedule()", FOR CHRISSAKE! You clearly aren't freezable!
> 
> Which is exactly what the new lockdep warning was all about. Don't try
> to move the problem around, when it's quite clear where the problem
> is. If you need to do something uninterruptible, you do not say "oh,
> I'm freezable". Because freezing is by definition an interruption.
> Seriously, it's that simple.

It _shouldn't_ be an interruption unless the filesystem can't make
progress.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Linus Torvalds
On Thu, Mar 7, 2013 at 3:41 AM, Jeff Layton  wrote:
>
> I think Trond may be on the right track. We probably need some
> mechanism to quiesce the filesystem ahead of any sort of freezer
> event.

No, guys. That cannot work. It's a completely moronic idea. Let me
count the way:

 (a) it's just another form of saying "lock". But since other things
are (by definition) going on when it happens, it will just cause
deadlocks.

 (b) the freeze event might not even be system-global. So *some*
processes (a cgroup) might freeze, others would not. You can't shut
off the filesystem just because some processes migth freeze.

 (c) it just moves the same issue somewhere else. If you have some
operation that must be done under the lock, then such an operation
must be completed before you've quiesced the filesystem, which is your
whole point of that "quiesce" event. BUT THAT'S THE EXACT SAME ISSUE
AS NOT ALLOWING THE FREEZE TO HAPPEN DURING THAT TIME.

In other words, that suggestion not only introduces new problems (a),
it's fundamentally broken anyway (b) *AND* it doesn't even solve
anything, it just moves it around.

The solution is damn simple: if you're in some kind of "atomic
region", then you cannot freeze. Seriously. SO DON'T CALL
"freezable_schedule()", FOR CHRISSAKE! You clearly aren't freezable!

Which is exactly what the new lockdep warning was all about. Don't try
to move the problem around, when it's quite clear where the problem
is. If you need to do something uninterruptible, you do not say "oh,
I'm freezable". Because freezing is by definition an interruption.
Seriously, it's that simple.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Tejun Heo
Hello, Jeff.

On Thu, Mar 07, 2013 at 06:41:40AM -0500, Jeff Layton wrote:
> Suppose I call unlink("somefile"); on an NFS mount. We take all of the
> VFS locks, go down into the NFS layer. That marshals up the UNLINK
> call, sends it off to the server, and waits for the reply. While we're
> waiting, a freeze event comes in and we start returning from the
> kernel with our new -EFREEZE return code that works sort of like
> -ERESTARTSYS. Meanwhile, the server is processing the UNLINK call and
> removes the file. A little while later we wake up the machine and it
> goes to try and pick up where it left off.

But you can't freeze regardless of the freezing mechanism in such
cases, right?  The current code which allows freezing while such
operations are in progress is broken as it can lead to freezer
deadlocks.  They should go away no matter how we implement freezer, so
the question is not whether we can move all the existing freezing
points to signal mechanism but that, after removing the deadlock-prone
ones, how many would be difficult to convert.  I'm fully speculating
but my suspicion is not too many if you remove (or update somehow) the
ones which are being done with some locks held.

> The catch here is that it's quite possible that when we need to quiesce
> that we've lost communications with the server. We don't want to hold
> up the freezer at that point so the wait for replies has to be bounded
> in time somehow. If that times out, we probably just have to return all
> calls with our new -EFREEZE return and hope for the best when the
> machine wakes back up.

Sure, a separate prep step may be helpful but assuming a user
nfs-mounting stuff on a laptop, I'm not sure how reliable that can be
made.  People move around with laptops, wifi can be iffy and the lid
can be shut at any moment.  I don't think it's possible for nfs to be
laptop friendly while staying completely correct.  Designing such a
network filesystem probably is possible with transactions and whatnot
but AFAIU nfs isn't designed that way.  If such use case is something
nfs wants to support, I think it just should make do with some
middleground - ie. just implement a mount switch which says "retry
operations across network / power problems" and explain the
implications.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Jeff Layton
On Wed, 6 Mar 2013 13:36:36 -0800
Tejun Heo  wrote:

> On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> > So I do agree that we probably have *too* many of the stupid "let's
> > check if we can freeze", and I suspect that the NFS code should get
> > rid of the "freezable_schedule()" that is causing this warning
> > (because I also agree that you should *not* freeze while holding
> > locks, because it really can cause deadlocks), but I do suspect that
> > network filesystems do need to have a few places where they check for
> > freezing on their own... Exactly because freezing isn't *quite* like a
> > signal.
> 
> Well, I don't really know much about nfs so I can't really tell, but
> for most other cases, dealing with freezing like a signal should work
> fine from what I've seen although I can't be sure before actually
> trying.  Trond, Bruce, can you guys please chime in?
> 
> Thanks.
> 

(hopefully this isn't tl;dr)

It's not quite that simple...

The problem (as Trond already mentioned) is non-idempotent operations.
You can't just restart certain operations from scratch once you reach a
certain point. Here's an example:

Suppose I call unlink("somefile"); on an NFS mount. We take all of the
VFS locks, go down into the NFS layer. That marshals up the UNLINK
call, sends it off to the server, and waits for the reply. While we're
waiting, a freeze event comes in and we start returning from the
kernel with our new -EFREEZE return code that works sort of like
-ERESTARTSYS. Meanwhile, the server is processing the UNLINK call and
removes the file. A little while later we wake up the machine and it
goes to try and pick up where it left off.

What do we do now?

Suppose we pretend we never sent the call in the first place, marshal
up a new RPC and send it again. This is problematic -- the server will
probably send back the equivalent of ENOENT. How do we know whether the
file never existed in the first place, or whether the server processed
the original call and removed the file then?

Do we instead try and keep track of whether the RPC has been sent and
just wait for the reply on the original call? That's tricky too -- it
means adding an extra codepath to check for these sorts of restarts in
a bunch of different ops vectors into the filesystem. We also have to
somehow keep track of this state too (I guess by hanging something off
the task_struct).

Note too that the above is the simple case. We're dropping the parent's
i_mutex during the freeze. Suppose when we restart the call that the
parent directory has changed in such a way that the original lookup we
did to do the original RPC is no longer valid?

I think Trond may be on the right track. We probably need some
mechanism to quiesce the filesystem ahead of any sort of freezer
event. That quiesce could simply wait on any in flight RPCs to come
back, and not allow any new ones to go out. On syscalls where the RPC
didn't go out, we'd just return -EFREEZE or whatever and let the upper
layers restart the call after waking back up. Writeback would be
tricky, but that can be handled too.

The catch here is that it's quite possible that when we need to quiesce
that we've lost communications with the server. We don't want to hold
up the freezer at that point so the wait for replies has to be bounded
in time somehow. If that times out, we probably just have to return all
calls with our new -EFREEZE return and hope for the best when the
machine wakes back up.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Jeff Layton
On Wed, 6 Mar 2013 13:36:36 -0800
Tejun Heo t...@kernel.org wrote:

 On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
  So I do agree that we probably have *too* many of the stupid let's
  check if we can freeze, and I suspect that the NFS code should get
  rid of the freezable_schedule() that is causing this warning
  (because I also agree that you should *not* freeze while holding
  locks, because it really can cause deadlocks), but I do suspect that
  network filesystems do need to have a few places where they check for
  freezing on their own... Exactly because freezing isn't *quite* like a
  signal.
 
 Well, I don't really know much about nfs so I can't really tell, but
 for most other cases, dealing with freezing like a signal should work
 fine from what I've seen although I can't be sure before actually
 trying.  Trond, Bruce, can you guys please chime in?
 
 Thanks.
 

(hopefully this isn't tl;dr)

It's not quite that simple...

The problem (as Trond already mentioned) is non-idempotent operations.
You can't just restart certain operations from scratch once you reach a
certain point. Here's an example:

Suppose I call unlink(somefile); on an NFS mount. We take all of the
VFS locks, go down into the NFS layer. That marshals up the UNLINK
call, sends it off to the server, and waits for the reply. While we're
waiting, a freeze event comes in and we start returning from the
kernel with our new -EFREEZE return code that works sort of like
-ERESTARTSYS. Meanwhile, the server is processing the UNLINK call and
removes the file. A little while later we wake up the machine and it
goes to try and pick up where it left off.

What do we do now?

Suppose we pretend we never sent the call in the first place, marshal
up a new RPC and send it again. This is problematic -- the server will
probably send back the equivalent of ENOENT. How do we know whether the
file never existed in the first place, or whether the server processed
the original call and removed the file then?

Do we instead try and keep track of whether the RPC has been sent and
just wait for the reply on the original call? That's tricky too -- it
means adding an extra codepath to check for these sorts of restarts in
a bunch of different ops vectors into the filesystem. We also have to
somehow keep track of this state too (I guess by hanging something off
the task_struct).

Note too that the above is the simple case. We're dropping the parent's
i_mutex during the freeze. Suppose when we restart the call that the
parent directory has changed in such a way that the original lookup we
did to do the original RPC is no longer valid?

I think Trond may be on the right track. We probably need some
mechanism to quiesce the filesystem ahead of any sort of freezer
event. That quiesce could simply wait on any in flight RPCs to come
back, and not allow any new ones to go out. On syscalls where the RPC
didn't go out, we'd just return -EFREEZE or whatever and let the upper
layers restart the call after waking back up. Writeback would be
tricky, but that can be handled too.

The catch here is that it's quite possible that when we need to quiesce
that we've lost communications with the server. We don't want to hold
up the freezer at that point so the wait for replies has to be bounded
in time somehow. If that times out, we probably just have to return all
calls with our new -EFREEZE return and hope for the best when the
machine wakes back up.

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Tejun Heo
Hello, Jeff.

On Thu, Mar 07, 2013 at 06:41:40AM -0500, Jeff Layton wrote:
 Suppose I call unlink(somefile); on an NFS mount. We take all of the
 VFS locks, go down into the NFS layer. That marshals up the UNLINK
 call, sends it off to the server, and waits for the reply. While we're
 waiting, a freeze event comes in and we start returning from the
 kernel with our new -EFREEZE return code that works sort of like
 -ERESTARTSYS. Meanwhile, the server is processing the UNLINK call and
 removes the file. A little while later we wake up the machine and it
 goes to try and pick up where it left off.

But you can't freeze regardless of the freezing mechanism in such
cases, right?  The current code which allows freezing while such
operations are in progress is broken as it can lead to freezer
deadlocks.  They should go away no matter how we implement freezer, so
the question is not whether we can move all the existing freezing
points to signal mechanism but that, after removing the deadlock-prone
ones, how many would be difficult to convert.  I'm fully speculating
but my suspicion is not too many if you remove (or update somehow) the
ones which are being done with some locks held.

 The catch here is that it's quite possible that when we need to quiesce
 that we've lost communications with the server. We don't want to hold
 up the freezer at that point so the wait for replies has to be bounded
 in time somehow. If that times out, we probably just have to return all
 calls with our new -EFREEZE return and hope for the best when the
 machine wakes back up.

Sure, a separate prep step may be helpful but assuming a user
nfs-mounting stuff on a laptop, I'm not sure how reliable that can be
made.  People move around with laptops, wifi can be iffy and the lid
can be shut at any moment.  I don't think it's possible for nfs to be
laptop friendly while staying completely correct.  Designing such a
network filesystem probably is possible with transactions and whatnot
but AFAIU nfs isn't designed that way.  If such use case is something
nfs wants to support, I think it just should make do with some
middleground - ie. just implement a mount switch which says retry
operations across network / power problems and explain the
implications.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Linus Torvalds
On Thu, Mar 7, 2013 at 3:41 AM, Jeff Layton jlay...@redhat.com wrote:

 I think Trond may be on the right track. We probably need some
 mechanism to quiesce the filesystem ahead of any sort of freezer
 event.

No, guys. That cannot work. It's a completely moronic idea. Let me
count the way:

 (a) it's just another form of saying lock. But since other things
are (by definition) going on when it happens, it will just cause
deadlocks.

 (b) the freeze event might not even be system-global. So *some*
processes (a cgroup) might freeze, others would not. You can't shut
off the filesystem just because some processes migth freeze.

 (c) it just moves the same issue somewhere else. If you have some
operation that must be done under the lock, then such an operation
must be completed before you've quiesced the filesystem, which is your
whole point of that quiesce event. BUT THAT'S THE EXACT SAME ISSUE
AS NOT ALLOWING THE FREEZE TO HAPPEN DURING THAT TIME.

In other words, that suggestion not only introduces new problems (a),
it's fundamentally broken anyway (b) *AND* it doesn't even solve
anything, it just moves it around.

The solution is damn simple: if you're in some kind of atomic
region, then you cannot freeze. Seriously. SO DON'T CALL
freezable_schedule(), FOR CHRISSAKE! You clearly aren't freezable!

Which is exactly what the new lockdep warning was all about. Don't try
to move the problem around, when it's quite clear where the problem
is. If you need to do something uninterruptible, you do not say oh,
I'm freezable. Because freezing is by definition an interruption.
Seriously, it's that simple.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Myklebust, Trond
On Thu, 2013-03-07 at 07:55 -0800, Linus Torvalds wrote:
 On Thu, Mar 7, 2013 at 3:41 AM, Jeff Layton jlay...@redhat.com wrote:
 
  I think Trond may be on the right track. We probably need some
  mechanism to quiesce the filesystem ahead of any sort of freezer
  event.
 
 No, guys. That cannot work. It's a completely moronic idea. Let me
 count the way:
 
  (a) it's just another form of saying lock. But since other things
 are (by definition) going on when it happens, it will just cause
 deadlocks.
 
  (b) the freeze event might not even be system-global. So *some*
 processes (a cgroup) might freeze, others would not. You can't shut
 off the filesystem just because some processes migth freeze.

That's the whole bloody problem in a nutshell.

We only want to freeze the filesystem when the network goes down, and in
that case we want time to clean up first. That's why it needs to be
initiated by something like NetworkManager _before_ the network is shut
down.

  (c) it just moves the same issue somewhere else. If you have some
 operation that must be done under the lock, then such an operation
 must be completed before you've quiesced the filesystem, which is your
 whole point of that quiesce event. BUT THAT'S THE EXACT SAME ISSUE
 AS NOT ALLOWING THE FREEZE TO HAPPEN DURING THAT TIME.
 
 In other words, that suggestion not only introduces new problems (a),
 it's fundamentally broken anyway (b) *AND* it doesn't even solve
 anything, it just moves it around.
 
 The solution is damn simple: if you're in some kind of atomic
 region, then you cannot freeze. Seriously. SO DON'T CALL
 freezable_schedule(), FOR CHRISSAKE! You clearly aren't freezable!
 
 Which is exactly what the new lockdep warning was all about. Don't try
 to move the problem around, when it's quite clear where the problem
 is. If you need to do something uninterruptible, you do not say oh,
 I'm freezable. Because freezing is by definition an interruption.
 Seriously, it's that simple.

It _shouldn't_ be an interruption unless the filesystem can't make
progress.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Tejun Heo
Hello, Linus.

On Thu, Mar 07, 2013 at 07:55:39AM -0800, Linus Torvalds wrote:
 In other words, that suggestion not only introduces new problems (a),
 it's fundamentally broken anyway (b) *AND* it doesn't even solve
 anything, it just moves it around.

I don't think it's gonna solve the problems we're talking about but
the moving around part could be useful nonetheless - e.g. we don't
want to shut down network before nfs while suspending.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Linus Torvalds
On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
trond.mykleb...@netapp.com wrote:

 It _shouldn't_ be an interruption unless the filesystem can't make
 progress.

So how can we tell? Calling freezable_schedule() if you're not ready
to be frozen is not good. And nobody but the NFS code can know.

You might want to introduce some counter that counts number of
outstanding non-interruptible events, and only call the freezable
version if that counter is zero.

A better alternative might be to *never* call the freezable version.
Because those freezable_*() things are really quite disgusting, and
are wrong - they don't actually freeze the process, they say I don't
care if you freeze me while I sleep, and you might actually wake up
*while* the system is being frozen. I think the whole concept is
broken. Rafaei - comments? The function really is crap, regardless of
any unrelated NFS problems.

So what NFS could do instead is actually check the do I need to
freeze flag, and if you need to freeze you consider it an abort - and
do *not* try to continue. Just freeze, and then act as if the machine
got rebooted as far as NFS was concerned. That should work anyway, no?

That does sound a lot more complex, though.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Myklebust, Trond
On Thu, 2013-03-07 at 08:25 -0800, Linus Torvalds wrote:
 On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
 trond.mykleb...@netapp.com wrote:
 
  It _shouldn't_ be an interruption unless the filesystem can't make
  progress.
 
 So how can we tell? Calling freezable_schedule() if you're not ready
 to be frozen is not good. And nobody but the NFS code can know.
 
 You might want to introduce some counter that counts number of
 outstanding non-interruptible events, and only call the freezable
 version if that counter is zero.
 
 A better alternative might be to *never* call the freezable version.
 Because those freezable_*() things are really quite disgusting, and
 are wrong - they don't actually freeze the process, they say I don't
 care if you freeze me while I sleep, and you might actually wake up
 *while* the system is being frozen. I think the whole concept is
 broken. Rafaei - comments? The function really is crap, regardless of
 any unrelated NFS problems.
 
 So what NFS could do instead is actually check the do I need to
 freeze flag, and if you need to freeze you consider it an abort - and
 do *not* try to continue. Just freeze, and then act as if the machine
 got rebooted as far as NFS was concerned. That should work anyway, no?
 
 That does sound a lot more complex, though.

The problem there is that we get into the whole 'hard' vs 'soft' mount
problem. We're supposed to guarantee data integrity for 'hard' mounts,
so no funny business is allowed. OTOH, 'soft' mounts time out and return
EIO to the application anyway, and so shouldn't be a problem.

Perhaps we could add a '-oslushy' mount option :-) that guarantees data
integrity for all situations _except_ ENETDOWN/ENETUNREACH?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Linus Torvalds
On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
trond.mykleb...@netapp.com wrote:

 The problem there is that we get into the whole 'hard' vs 'soft' mount
 problem. We're supposed to guarantee data integrity for 'hard' mounts,
 so no funny business is allowed. OTOH, 'soft' mounts time out and return
 EIO to the application anyway, and so shouldn't be a problem.

 Perhaps we could add a '-oslushy' mount option :-) that guarantees data
 integrity for all situations _except_ ENETDOWN/ENETUNREACH?

I do think we are probably over-analyzing this. It's not like people
who want freezing to work usually use flaky NFS. There's really two
main groups:

 - the freezer as a snapshot mechanism that might use NFS because
they are in a server environment.

 - the freeezer for suspend/resume on a laptop

The first one does use NFS, and cares about it, and probably would
prefer the freeze event to take longer and finish for all ongoing IO
operations. End result: just ditch the freezable_schedule()
entirely.

The second one is unlikely to really use NFS anyway. End result:
ditching the freezable_schedule() is probably perfectly fine, even if
it would cause suspend failures if the network is being troublesome.

So for now, why not just replace freezable_schedule() with plain
schedule() in the NFS code, and ignore it until somebody actually
complains about it, and then aim to try to do something more targeted
for that particular complaint?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Myklebust, Trond
On Thu, 2013-03-07 at 09:03 -0800, Linus Torvalds wrote:
 On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
 trond.mykleb...@netapp.com wrote:
 
  The problem there is that we get into the whole 'hard' vs 'soft' mount
  problem. We're supposed to guarantee data integrity for 'hard' mounts,
  so no funny business is allowed. OTOH, 'soft' mounts time out and return
  EIO to the application anyway, and so shouldn't be a problem.
 
  Perhaps we could add a '-oslushy' mount option :-) that guarantees data
  integrity for all situations _except_ ENETDOWN/ENETUNREACH?
 
 I do think we are probably over-analyzing this. It's not like people
 who want freezing to work usually use flaky NFS. There's really two
 main groups:
 
  - the freezer as a snapshot mechanism that might use NFS because
 they are in a server environment.
 
  - the freeezer for suspend/resume on a laptop
 
 The first one does use NFS, and cares about it, and probably would
 prefer the freeze event to take longer and finish for all ongoing IO
 operations. End result: just ditch the freezable_schedule()
 entirely.
 
 The second one is unlikely to really use NFS anyway. End result:
 ditching the freezable_schedule() is probably perfectly fine, even if
 it would cause suspend failures if the network is being troublesome.
 
 So for now, why not just replace freezable_schedule() with plain
 schedule() in the NFS code, and ignore it until somebody actually
 complains about it, and then aim to try to do something more targeted
 for that particular complaint?

We _have_ had complaints about the laptop suspension problem; that was
why Jeff introduced freezable_schedule() in the first place. We've never
had complaints about any problems involvinf cgroup_freeze. This is why
our focus tends to be on the former, and why I'm more worried about
laptop suspend regressions for any short term fixes.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Rafael J. Wysocki
On Thursday, March 07, 2013 08:25:10 AM Linus Torvalds wrote:
 On Thu, Mar 7, 2013 at 7:59 AM, Myklebust, Trond
 trond.mykleb...@netapp.com wrote:
 
  It _shouldn't_ be an interruption unless the filesystem can't make
  progress.
 
 So how can we tell? Calling freezable_schedule() if you're not ready
 to be frozen is not good. And nobody but the NFS code can know.
 
 You might want to introduce some counter that counts number of
 outstanding non-interruptible events, and only call the freezable
 version if that counter is zero.
 
 A better alternative might be to *never* call the freezable version.
 Because those freezable_*() things are really quite disgusting, and
 are wrong - they don't actually freeze the process, they say I don't
 care if you freeze me while I sleep, and you might actually wake up
 *while* the system is being frozen. I think the whole concept is
 broken. Rafaei - comments?

Well, the only comment I have is that the source of these functions was
commit d310310c (Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to
block the freezer) and they were added because people were complaining that
they couldn't suspend while their NFS servers were not responding, IIRC.

I agree that they are not really nice.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-07 Thread Jeff Layton
On Thu, 7 Mar 2013 17:16:12 +
Myklebust, Trond trond.mykleb...@netapp.com wrote:

 On Thu, 2013-03-07 at 09:03 -0800, Linus Torvalds wrote:
  On Thu, Mar 7, 2013 at 8:45 AM, Myklebust, Trond
  trond.mykleb...@netapp.com wrote:
  
   The problem there is that we get into the whole 'hard' vs 'soft' mount
   problem. We're supposed to guarantee data integrity for 'hard' mounts,
   so no funny business is allowed. OTOH, 'soft' mounts time out and return
   EIO to the application anyway, and so shouldn't be a problem.
  
   Perhaps we could add a '-oslushy' mount option :-) that guarantees data
   integrity for all situations _except_ ENETDOWN/ENETUNREACH?
  
  I do think we are probably over-analyzing this. It's not like people
  who want freezing to work usually use flaky NFS. There's really two
  main groups:
  
   - the freezer as a snapshot mechanism that might use NFS because
  they are in a server environment.
  
   - the freeezer for suspend/resume on a laptop
  
  The first one does use NFS, and cares about it, and probably would
  prefer the freeze event to take longer and finish for all ongoing IO
  operations. End result: just ditch the freezable_schedule()
  entirely.
  
  The second one is unlikely to really use NFS anyway. End result:
  ditching the freezable_schedule() is probably perfectly fine, even if
  it would cause suspend failures if the network is being troublesome.
  
  So for now, why not just replace freezable_schedule() with plain
  schedule() in the NFS code, and ignore it until somebody actually
  complains about it, and then aim to try to do something more targeted
  for that particular complaint?
 
 We _have_ had complaints about the laptop suspension problem; that was
 why Jeff introduced freezable_schedule() in the first place. We've never
 had complaints about any problems involvinf cgroup_freeze. This is why
 our focus tends to be on the former, and why I'm more worried about
 laptop suspend regressions for any short term fixes.
 

Right. I don't know of anyone using the cgroup_freeze stuff. OTOH, I've
seen *many* reports of people who complained because their machine
didn't suspend because of a busy NFS mount.

In fact, the problem is still not fully solved with the
freezable_schedule stuff we have so far anyway. Even after this
(admittedly crappy) solution went in, we still had reports of machines
that failed to suspend because other tasks were stuck in
wait_on_page_writeback() for NFS pages.

So, in principle I think we need to abandon the current approach. The
question is what should replace it.

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
On Wed, Mar 06, 2013 at 01:36:36PM -0800, Tejun Heo wrote:
> On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> > So I do agree that we probably have *too* many of the stupid "let's
> > check if we can freeze", and I suspect that the NFS code should get
> > rid of the "freezable_schedule()" that is causing this warning
> > (because I also agree that you should *not* freeze while holding
> > locks, because it really can cause deadlocks), but I do suspect that
> > network filesystems do need to have a few places where they check for
> > freezing on their own... Exactly because freezing isn't *quite* like a
> > signal.
> 
> Well, I don't really know much about nfs so I can't really tell, but
> for most other cases, dealing with freezing like a signal should work
> fine from what I've seen although I can't be sure before actually
> trying.  Trond, Bruce, can you guys please chime in?

So, I think the question here would be, in nfs, how many of the
current freezer check points would be difficult to conver to signal
handling model after excluding the ones which are performed while
holding some locks which we need to get rid of anyway?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
> So I do agree that we probably have *too* many of the stupid "let's
> check if we can freeze", and I suspect that the NFS code should get
> rid of the "freezable_schedule()" that is causing this warning
> (because I also agree that you should *not* freeze while holding
> locks, because it really can cause deadlocks), but I do suspect that
> network filesystems do need to have a few places where they check for
> freezing on their own... Exactly because freezing isn't *quite* like a
> signal.

Well, I don't really know much about nfs so I can't really tell, but
for most other cases, dealing with freezing like a signal should work
fine from what I've seen although I can't be sure before actually
trying.  Trond, Bruce, can you guys please chime in?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Linus Torvalds
On Wed, Mar 6, 2013 at 1:24 PM, Tejun Heo  wrote:
>
> With syscall paths out of the way, the surface is reduced a lot.

So the issue is syscalls that don't react to signals, and that can
potentially wait a long time.

Like NFS with a network hickup. Which is not at all unlikely. Think
wireless network, somebody trying to get on a network share, things
not working, and closing the damn lid because you give up.

So I do agree that we probably have *too* many of the stupid "let's
check if we can freeze", and I suspect that the NFS code should get
rid of the "freezable_schedule()" that is causing this warning
(because I also agree that you should *not* freeze while holding
locks, because it really can cause deadlocks), but I do suspect that
network filesystems do need to have a few places where they check for
freezing on their own... Exactly because freezing isn't *quite* like a
signal.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
Hello, Linus.

On Wed, Mar 06, 2013 at 01:00:02PM -0800, Linus Torvalds wrote:
> > Oh yeah, we don't need another signal.  We just need sigpending state
> > and a wakeup.  I wasn't really going into details.  The important
> > point is that for code paths outside signal/ptrace, freezing could
> > look and behave about the same as signal delivery.
> 
> Don't we already do that? The whole "try_to_freeze()" in
> get_signal_to_deliver() is about exactly this. See
> fake_signal_wake_up().

Yeap, that was what I had in mind too.  Maybe we'll need to modify it
slightly but we already have most of the basic stuff.

> You still have kernel threads (that don't do signals) to worry about,
> so it doesn't make things go away. And you still have issues with
> latency of disk wait, which is, I think, the reason for that
> "freezable_schedule()" in the NFS code to begin with.

I haven't thought about it for quite some time so things are hazy, but
here's what I can recall now.

With syscall paths out of the way, the surface is reduced a lot.
Another part is converting most freezable kthread users to freezable
workqueue which provides natural resource boundaries (the duration of
work item execution).  kthread is already difficult to get the
synchronization completely right and significant number of freezable +
should_stop users are subtly broken the last time I went over the
freezer users.  I think we would be much better off converting most
over to freezable workqueues which is easier to get right and likely
to be less expensive.  Freezing happens at work item boundary which in
most cases could be made to coincide with the original freezer check
point.

There could be kthreads which can't be converted to workqueue for
whatever reason (there shouldn't be many at this point) but most
freezer usages in kthreads are pretty simple.  It's usually single or
a couple freezer check points in the main loop.  While we may still
need special handling for them, I don't think they're likely to have
implications on issues like this.

We probably would want to handle restart for freezable kthreads
calling syscalls.  Haven't thought about this one too much yet.  Maybe
freezable kthreads doing syscalls just need to be ready for
-ERESTARTSYS?

I'm not sure I follow the disk wait latency part.  Are you saying that
switching to jobctl trap based freezer implementation wouldn't help
them?  If so, right, it doesn't in itself.  It's just changing the
infrastructure used for freezing and can't make the underlying
synchronization issues just disappear but at least it becomes the same
problem as being responsive to SIGKILL rather than a completely
separate problem.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Linus Torvalds
On Wed, Mar 6, 2013 at 10:53 AM, Tejun Heo  wrote:
> Hello, Oleg.
>
> On Wed, Mar 06, 2013 at 07:16:08PM +0100, Oleg Nesterov wrote:
>> And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
>> layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
>> and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).
>
> Oh yeah, we don't need another signal.  We just need sigpending state
> and a wakeup.  I wasn't really going into details.  The important
> point is that for code paths outside signal/ptrace, freezing could
> look and behave about the same as signal delivery.

Don't we already do that? The whole "try_to_freeze()" in
get_signal_to_deliver() is about exactly this. See
fake_signal_wake_up().

You still have kernel threads (that don't do signals) to worry about,
so it doesn't make things go away. And you still have issues with
latency of disk wait, which is, I think, the reason for that
"freezable_schedule()" in the NFS code to begin with.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Mandeep Singh Baines
On Wed, Mar 6, 2013 at 10:37 AM, Myklebust, Trond
 wrote:
> On Wed, 2013-03-06 at 13:23 -0500, Jeff Layton wrote:
>> On Wed, 6 Mar 2013 07:59:01 -0800
>> Mandeep Singh Baines  wrote:
>> > In general, holding a lock and freezing can cause a deadlock if:
>> >
>> > 1) you froze via the cgroup_freezer subsystem and a task in another
>> > cgroup tried to acquire the same lock
>> > 2) the lock was needed later is suspend/hibernate. For example, if the
>> > lock was needed in dpm_suspend by one of the device callbacks. For
>> > hibernate, you also need to worry about any locks that need to be
>> > acquired in order to write to the swap device.
>> > 3) another freezing task blocked on this lock and held other locks
>> > needed later in suspend. If that task were skipped by the freezer, you
>> > would deadlock
>> >
>> > You will block/prevent suspend if:
>> >
>> > 4) another freezing task blocked on this lock and was unable to freeze
>> >
>> > I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
>> > cgroup freezer. Case 4) while not causing a deadlock could prevent
>> > your laptop/phone from sleeping and end up burning all your battery.
>> > If suspend is initiated via lid close you won't even know about the
>> > failure.
>> >
>>
>> We're aware of #4. That was the intent of adding try_to_freeze() into
>> this codepath in the first place. It's not a great solution for obvious
>> reasons, but we don't have another at the moment.
>>
>> For #1 I'm not sure what to do. I'm that familiar with cgroups or how
>> the freezer works.
>>
>> The bottom line is that we have a choice -- we can either rip out this
>> new lockdep warning, or rip out the code that causes it to fire.
>>
>> If we rip out the warning we may miss some legit cases where we might
>> possibly have hit a deadlock.
>>
>> If we rip out the code that causes it to fire, then we exacerbate the
>> #4 problem above. That will effectively make it so that you can't
>> suspend the host whenever NFS is doing anything moderately active.
>
> #4 is probably the only case where we might want to freeze.
>
> Unless we're in a situation where the network is going down, we can
> usually always make progress with completing the RPC call and finishing
> the system call. So in the case of cgroup_freezer, we only care if the
> freezing cgroup also owns the network device (or net namespace) that NFS
> is using to talk to the server.
>
> As I said, the alternative is to notify NFS that the device is going
> down, and to give it a chance to quiesce itself before that happens.
> This is also the only way to ensure that processes which own locks on
> the server (e.g. posix file locks) have a chance to release them before
> being suspended.
>

So if I'm understanding correctly, the locks are held for bounded time
so under normal situation you don't really need to freeze here. But if
the task that is going to wake you were to freeze then this code path
could block suspend if PF_FREEZER_SKIP were not set.

For this particular case (not in general), instead of calling
try_to_freeze(), what if you only cleared PF_FREEZER_SKIP. For case
#1, you wouldn't block suspend if you're stuck waiting. If you
received the event while processes were freezing, you'd just
try_to_freeze() some place else. Probably on the way back up from the
system call. For case #4, you'd eventually complete the rpc and then
freeze some place else. If you're stuck waiting for the bit because
the process that was going to wake you also got frozen, then the
container would be stuck in the THAWING state. Not catastrophic and
something the administrator could workaround by sequencing things
right.

Does this sound like a reasonable workaround?

Regards,
Mandeep

>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> trond.mykleb...@netapp.com
> www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
Hello, Oleg.

On Wed, Mar 06, 2013 at 07:16:08PM +0100, Oleg Nesterov wrote:
> And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
> layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
> and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).

Oh yeah, we don't need another signal.  We just need sigpending state
and a wakeup.  I wasn't really going into details.  The important
point is that for code paths outside signal/ptrace, freezing could
look and behave about the same as signal delivery.

> But if we can do this, then it should be possible so simply make these
> sleeps TASK_INTERRUPTIBLE ? But it seems that we can't just because we
> can't always restart, so I am totally confused.

Of course, switching to another freezing mechanism doesn't make issues
like this go away in itself, but it would at least allow handling such
issues in easier to grasp way rather than dealing with this giant
pseudo lock and allows code paths outside jobctl to simply deal with
one issue - signal delivery, rather than having to deal with freezing
separately.  Also, while not completely identical, frozen state would
at least be an extension of jobctl trap and it would be possible to
allow things like killing frozen tasks or even ptracing them in
well-defined manner for cgroup_freezer use cases.  As it currently
stands, there is no well-defined abstraction for frozen state which we
can expose in any way, which sucks.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
Hello,

On Wed, Mar 06, 2013 at 01:40:04PM -0500, Jeff Layton wrote:
> Though when I said that, it was before Tejun mentioned hooking this up
> to ptrace. I'll confess that I don't fully understand what he's
> proposing either though...

Oh, they're all just pretty closely related.  All signal and ptrace
traps run off get_signal_to_deliver().  It is still a bit hairy but in
much better shape after cleanups which accompanied PTRACE_SEIZE and I
don't really think it'll be too difficult to extend it so that
freezing is implemented as a type of jobctl trap.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 19:17:35 +0100
Oleg Nesterov  wrote:

> On 03/05, Jeff Layton wrote:
> >
> > Anyone up for working out how to handle a freeze event on a process
> > that already has a pending signal, while it's being ptraced?
> 
> Could you explain the problem?
> 

Not very well. I was just saying that the signal/ptrace stuff is very
complicated already. Now we're looking at mixing in the freezer too,
which further increases the complexity.

Though when I said that, it was before Tejun mentioned hooking this up
to ptrace. I'll confess that I don't fully understand what he's
proposing either though...

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Myklebust, Trond
On Wed, 2013-03-06 at 13:23 -0500, Jeff Layton wrote:
> On Wed, 6 Mar 2013 07:59:01 -0800
> Mandeep Singh Baines  wrote:
> > In general, holding a lock and freezing can cause a deadlock if:
> > 
> > 1) you froze via the cgroup_freezer subsystem and a task in another
> > cgroup tried to acquire the same lock
> > 2) the lock was needed later is suspend/hibernate. For example, if the
> > lock was needed in dpm_suspend by one of the device callbacks. For
> > hibernate, you also need to worry about any locks that need to be
> > acquired in order to write to the swap device.
> > 3) another freezing task blocked on this lock and held other locks
> > needed later in suspend. If that task were skipped by the freezer, you
> > would deadlock
> > 
> > You will block/prevent suspend if:
> > 
> > 4) another freezing task blocked on this lock and was unable to freeze
> > 
> > I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
> > cgroup freezer. Case 4) while not causing a deadlock could prevent
> > your laptop/phone from sleeping and end up burning all your battery.
> > If suspend is initiated via lid close you won't even know about the
> > failure.
> > 
> 
> We're aware of #4. That was the intent of adding try_to_freeze() into
> this codepath in the first place. It's not a great solution for obvious
> reasons, but we don't have another at the moment.
> 
> For #1 I'm not sure what to do. I'm that familiar with cgroups or how
> the freezer works.
> 
> The bottom line is that we have a choice -- we can either rip out this
> new lockdep warning, or rip out the code that causes it to fire.
> 
> If we rip out the warning we may miss some legit cases where we might
> possibly have hit a deadlock.
> 
> If we rip out the code that causes it to fire, then we exacerbate the
> #4 problem above. That will effectively make it so that you can't
> suspend the host whenever NFS is doing anything moderately active.

#4 is probably the only case where we might want to freeze.

Unless we're in a situation where the network is going down, we can
usually always make progress with completing the RPC call and finishing
the system call. So in the case of cgroup_freezer, we only care if the
freezing cgroup also owns the network device (or net namespace) that NFS
is using to talk to the server.

As I said, the alternative is to notify NFS that the device is going
down, and to give it a chance to quiesce itself before that happens.
This is also the only way to ensure that processes which own locks on
the server (e.g. posix file locks) have a chance to release them before
being suspended.


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 07:59:01 -0800
Mandeep Singh Baines  wrote:

> On Wed, Mar 6, 2013 at 4:06 AM, Jeff Layton  wrote:
> > On Wed, 6 Mar 2013 10:09:14 +0100
> > Ingo Molnar  wrote:
> >
> >>
> >> * Mandeep Singh Baines  wrote:
> >>
> >> > On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo  wrote:
> >> > > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> >> > >> If it's really just a 2-line patch to try_to_freeze(), could it just 
> >> > >> be
> >> > >> carried out-of-tree by people that are specifically working on 
> >> > >> tracking
> >> > >> down these problems?
> >> > >>
> >> > >> But I don't have strong feelings about it--as long as it doesn't 
> >> > >> result
> >> > >> in the same known issues getting reported again and again
> >> > >
> >> > > Agreed, I don't think a Kconfig option is justified for this.  If this
> >> > > is really important, annotate broken paths so that it doesn't trigger
> >> > > spuriously; otherwise, please just remove it.
> >> > >
> >> >
> >> > Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
> >> >
> >> > Maybe, add a new lockdep API:
> >> >
> >> > lockdep_set_held_during_freeze(lock);
> >> >
> >> > Then when we do the check, ignore any locks that set this bit.
> >> >
> >> > Ingo, does this seem like a reasonable design to you?
> >>
> >> Am I reading the discussion correctly that the new warnings show REAL 
> >> potential
> >> deadlock scenarios, which can hit real users and can lock their box up in 
> >> entirely
> >> real usage scenarios?
> >>
> >> If yes then guys we _really_ don't want to use lockdep annotation to 
> >> _HIDE_ bugs.
> >> We typically use them to teach lockdep about things it does not know about.
> >>
> >> How about fixing the deadlocks instead?
> >>
> >
> > I do see how the freezer might fail to suspend certain tasks, but I
> > don't see the deadlock scenario here in the NFS/RPC case. Can someone
> > outline a situation where this might end up deadlocking? If not, then
> > I'd be inclined to say that while this may be a problem, the warning is
> > excessive...
> >
> 
> In general, holding a lock and freezing can cause a deadlock if:
> 
> 1) you froze via the cgroup_freezer subsystem and a task in another
> cgroup tried to acquire the same lock
> 2) the lock was needed later is suspend/hibernate. For example, if the
> lock was needed in dpm_suspend by one of the device callbacks. For
> hibernate, you also need to worry about any locks that need to be
> acquired in order to write to the swap device.
> 3) another freezing task blocked on this lock and held other locks
> needed later in suspend. If that task were skipped by the freezer, you
> would deadlock
> 
> You will block/prevent suspend if:
> 
> 4) another freezing task blocked on this lock and was unable to freeze
> 
> I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
> cgroup freezer. Case 4) while not causing a deadlock could prevent
> your laptop/phone from sleeping and end up burning all your battery.
> If suspend is initiated via lid close you won't even know about the
> failure.
> 

We're aware of #4. That was the intent of adding try_to_freeze() into
this codepath in the first place. It's not a great solution for obvious
reasons, but we don't have another at the moment.

For #1 I'm not sure what to do. I'm that familiar with cgroups or how
the freezer works.

The bottom line is that we have a choice -- we can either rip out this
new lockdep warning, or rip out the code that causes it to fire.

If we rip out the warning we may miss some legit cases where we might
possibly have hit a deadlock.

If we rip out the code that causes it to fire, then we exacerbate the
#4 problem above. That will effectively make it so that you can't
suspend the host whenever NFS is doing anything moderately active.

Ripping out the warning seems like the best course of action in the
near term, but it's not my call...

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Oleg Nesterov
On 03/05, Jeff Layton wrote:
>
> Anyone up for working out how to handle a freeze event on a process
> that already has a pending signal, while it's being ptraced?

Could you explain the problem?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Oleg Nesterov
On 03/05, Tejun Heo wrote:
>
> Oleg, are you still opposed to the idea of making freezer share trap
> points with ptrace?

My memory can fool me, but iirc I wasn't actually opposed... I guess
you mean the previous discussion about vfork/ptrace/etc which I forgot
completely.

But I can recall the main problem with your idea (with me): I simply
wasn't able to understand it ;)

Likewise, I can't really understand the ideas discussed in this thread.
At least when it come to this particular problem, rpc_wait_bit_killable()
is not interruptible...

And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).

But if we can do this, then it should be possible so simply make these
sleeps TASK_INTERRUPTIBLE ? But it seems that we can't just because we
can't always restart, so I am totally confused.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Mandeep Singh Baines
On Wed, Mar 6, 2013 at 4:06 AM, Jeff Layton  wrote:
> On Wed, 6 Mar 2013 10:09:14 +0100
> Ingo Molnar  wrote:
>
>>
>> * Mandeep Singh Baines  wrote:
>>
>> > On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo  wrote:
>> > > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
>> > >> If it's really just a 2-line patch to try_to_freeze(), could it just be
>> > >> carried out-of-tree by people that are specifically working on tracking
>> > >> down these problems?
>> > >>
>> > >> But I don't have strong feelings about it--as long as it doesn't result
>> > >> in the same known issues getting reported again and again
>> > >
>> > > Agreed, I don't think a Kconfig option is justified for this.  If this
>> > > is really important, annotate broken paths so that it doesn't trigger
>> > > spuriously; otherwise, please just remove it.
>> > >
>> >
>> > Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
>> >
>> > Maybe, add a new lockdep API:
>> >
>> > lockdep_set_held_during_freeze(lock);
>> >
>> > Then when we do the check, ignore any locks that set this bit.
>> >
>> > Ingo, does this seem like a reasonable design to you?
>>
>> Am I reading the discussion correctly that the new warnings show REAL 
>> potential
>> deadlock scenarios, which can hit real users and can lock their box up in 
>> entirely
>> real usage scenarios?
>>
>> If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ 
>> bugs.
>> We typically use them to teach lockdep about things it does not know about.
>>
>> How about fixing the deadlocks instead?
>>
>
> I do see how the freezer might fail to suspend certain tasks, but I
> don't see the deadlock scenario here in the NFS/RPC case. Can someone
> outline a situation where this might end up deadlocking? If not, then
> I'd be inclined to say that while this may be a problem, the warning is
> excessive...
>

In general, holding a lock and freezing can cause a deadlock if:

1) you froze via the cgroup_freezer subsystem and a task in another
cgroup tried to acquire the same lock
2) the lock was needed later is suspend/hibernate. For example, if the
lock was needed in dpm_suspend by one of the device callbacks. For
hibernate, you also need to worry about any locks that need to be
acquired in order to write to the swap device.
3) another freezing task blocked on this lock and held other locks
needed later in suspend. If that task were skipped by the freezer, you
would deadlock

You will block/prevent suspend if:

4) another freezing task blocked on this lock and was unable to freeze

I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
cgroup freezer. Case 4) while not causing a deadlock could prevent
your laptop/phone from sleeping and end up burning all your battery.
If suspend is initiated via lid close you won't even know about the
failure.

Regards,
Mandeep

> --
> Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 10:09:14 +0100
Ingo Molnar  wrote:

> 
> * Mandeep Singh Baines  wrote:
> 
> > On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo  wrote:
> > > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> > >> If it's really just a 2-line patch to try_to_freeze(), could it just be
> > >> carried out-of-tree by people that are specifically working on tracking
> > >> down these problems?
> > >>
> > >> But I don't have strong feelings about it--as long as it doesn't result
> > >> in the same known issues getting reported again and again
> > >
> > > Agreed, I don't think a Kconfig option is justified for this.  If this
> > > is really important, annotate broken paths so that it doesn't trigger
> > > spuriously; otherwise, please just remove it.
> > >
> > 
> > Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
> > 
> > Maybe, add a new lockdep API:
> > 
> > lockdep_set_held_during_freeze(lock);
> > 
> > Then when we do the check, ignore any locks that set this bit.
> > 
> > Ingo, does this seem like a reasonable design to you?
> 
> Am I reading the discussion correctly that the new warnings show REAL 
> potential 
> deadlock scenarios, which can hit real users and can lock their box up in 
> entirely 
> real usage scenarios?
> 
> If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ 
> bugs.
> We typically use them to teach lockdep about things it does not know about.
> 
> How about fixing the deadlocks instead?
> 

I do see how the freezer might fail to suspend certain tasks, but I
don't see the deadlock scenario here in the NFS/RPC case. Can someone
outline a situation where this might end up deadlocking? If not, then
I'd be inclined to say that while this may be a problem, the warning is
excessive...

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 01:10:07 +
"Myklebust, Trond"  wrote:

> On Tue, 2013-03-05 at 14:03 -0500, Jeff Layton wrote:
> > On Tue, 5 Mar 2013 09:49:54 -0800
> > Tejun Heo  wrote:
> > 
> > > On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > > > So, I think this is why implementing freezer as a separate blocking
> > > > mechanism isn't such a good idea.  We're effectively introducing a
> > > > completely new waiting state to a lot of unsuspecting paths which
> > > > generates a lot of risks and eventually extra complexity to work
> > > > around those.  I think we really should update freezer to re-use the
> > > > blocking points we already have - the ones used for signal delivery
> > > > and ptracing.  That way, other code paths don't have to worry about an
> > > > extra stop state and we can confine most complexities to freezer
> > > > proper.
> > > 
> > > Also, consolidating those wait states means that we can solve the
> > > event-to-response latency problem for all three cases - signal, ptrace
> > > and freezer, rather than adding separate backing-out strategy for
> > > freezer.
> > > 
> > 
> > Sounds intriguing...
> > 
> > I'm not sure what this really means for something like NFS though. How
> > would you envision this working when we have long running syscalls that
> > might sit waiting in the kernel indefinitely?
> > 
> > Here's my blue-sky, poorly-thought-out idea...
> > 
> > We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> > NFS/RPC layer to be interrupted. Those would return back toward
> > userland with a particular type of error (sort of like ERESTARTSYS).
> > 
> > Before returning from the kernel though, we could freeze the process.
> > When it wakes up, then we could go back down and retry the call again
> > (much like an ERESTARTSYS kind of thing).
> 
> Two (three?) show stopper words for you: "non-idempotent operations".
> 
> Not all RPC calls can just be interrupted and restarted. Something like
> an exclusive file create, unlink, file locking attempt, etc may give
> rise to different results when replayed in the above scenario.
> Interrupting an RPC call is not equivalent to cancelling its effects...
> 

Right -- that's the part where we have to take great care to save the
state of the syscall at the time we returned back up toward userland on
a freeze event. I suppose we'd need to hang something off the
task_struct to keep track of that.

In most cases, it would be sufficient to keep track of whether an RPC
had been sent during the call for non-idempotent operations. If it was
sent, then we'd just re-enter the wait for the reply. If it wasn't then
we'd go ahead and send the call.

Still, I'm sure there are details I'm overlooking here. The whole point
of holding these mutexes is to ensure that operations that the
directories don't change while we're doing these operations. If we
release the locks in order to go to sleep, then there's no guarantee
that things haven't changed when we reacquire them.

Maybe it's best to give up and just tell people that suspending your
laptop with a NFS mount is not allowed :P

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Ingo Molnar

* Mandeep Singh Baines  wrote:

> On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo  wrote:
> > On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> >> If it's really just a 2-line patch to try_to_freeze(), could it just be
> >> carried out-of-tree by people that are specifically working on tracking
> >> down these problems?
> >>
> >> But I don't have strong feelings about it--as long as it doesn't result
> >> in the same known issues getting reported again and again
> >
> > Agreed, I don't think a Kconfig option is justified for this.  If this
> > is really important, annotate broken paths so that it doesn't trigger
> > spuriously; otherwise, please just remove it.
> >
> 
> Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
> 
> Maybe, add a new lockdep API:
> 
> lockdep_set_held_during_freeze(lock);
> 
> Then when we do the check, ignore any locks that set this bit.
> 
> Ingo, does this seem like a reasonable design to you?

Am I reading the discussion correctly that the new warnings show REAL potential 
deadlock scenarios, which can hit real users and can lock their box up in 
entirely 
real usage scenarios?

If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ 
bugs.
We typically use them to teach lockdep about things it does not know about.

How about fixing the deadlocks instead?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Ingo Molnar

* Mandeep Singh Baines m...@chromium.org wrote:

 On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo t...@kernel.org wrote:
  On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
  If it's really just a 2-line patch to try_to_freeze(), could it just be
  carried out-of-tree by people that are specifically working on tracking
  down these problems?
 
  But I don't have strong feelings about it--as long as it doesn't result
  in the same known issues getting reported again and again
 
  Agreed, I don't think a Kconfig option is justified for this.  If this
  is really important, annotate broken paths so that it doesn't trigger
  spuriously; otherwise, please just remove it.
 
 
 Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
 
 Maybe, add a new lockdep API:
 
 lockdep_set_held_during_freeze(lock);
 
 Then when we do the check, ignore any locks that set this bit.
 
 Ingo, does this seem like a reasonable design to you?

Am I reading the discussion correctly that the new warnings show REAL potential 
deadlock scenarios, which can hit real users and can lock their box up in 
entirely 
real usage scenarios?

If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ 
bugs.
We typically use them to teach lockdep about things it does not know about.

How about fixing the deadlocks instead?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 01:10:07 +
Myklebust, Trond trond.mykleb...@netapp.com wrote:

 On Tue, 2013-03-05 at 14:03 -0500, Jeff Layton wrote:
  On Tue, 5 Mar 2013 09:49:54 -0800
  Tejun Heo t...@kernel.org wrote:
  
   On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
So, I think this is why implementing freezer as a separate blocking
mechanism isn't such a good idea.  We're effectively introducing a
completely new waiting state to a lot of unsuspecting paths which
generates a lot of risks and eventually extra complexity to work
around those.  I think we really should update freezer to re-use the
blocking points we already have - the ones used for signal delivery
and ptracing.  That way, other code paths don't have to worry about an
extra stop state and we can confine most complexities to freezer
proper.
   
   Also, consolidating those wait states means that we can solve the
   event-to-response latency problem for all three cases - signal, ptrace
   and freezer, rather than adding separate backing-out strategy for
   freezer.
   
  
  Sounds intriguing...
  
  I'm not sure what this really means for something like NFS though. How
  would you envision this working when we have long running syscalls that
  might sit waiting in the kernel indefinitely?
  
  Here's my blue-sky, poorly-thought-out idea...
  
  We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
  NFS/RPC layer to be interrupted. Those would return back toward
  userland with a particular type of error (sort of like ERESTARTSYS).
  
  Before returning from the kernel though, we could freeze the process.
  When it wakes up, then we could go back down and retry the call again
  (much like an ERESTARTSYS kind of thing).
 
 Two (three?) show stopper words for you: non-idempotent operations.
 
 Not all RPC calls can just be interrupted and restarted. Something like
 an exclusive file create, unlink, file locking attempt, etc may give
 rise to different results when replayed in the above scenario.
 Interrupting an RPC call is not equivalent to cancelling its effects...
 

Right -- that's the part where we have to take great care to save the
state of the syscall at the time we returned back up toward userland on
a freeze event. I suppose we'd need to hang something off the
task_struct to keep track of that.

In most cases, it would be sufficient to keep track of whether an RPC
had been sent during the call for non-idempotent operations. If it was
sent, then we'd just re-enter the wait for the reply. If it wasn't then
we'd go ahead and send the call.

Still, I'm sure there are details I'm overlooking here. The whole point
of holding these mutexes is to ensure that operations that the
directories don't change while we're doing these operations. If we
release the locks in order to go to sleep, then there's no guarantee
that things haven't changed when we reacquire them.

Maybe it's best to give up and just tell people that suspending your
laptop with a NFS mount is not allowed :P

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 10:09:14 +0100
Ingo Molnar mi...@kernel.org wrote:

 
 * Mandeep Singh Baines m...@chromium.org wrote:
 
  On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo t...@kernel.org wrote:
   On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
   If it's really just a 2-line patch to try_to_freeze(), could it just be
   carried out-of-tree by people that are specifically working on tracking
   down these problems?
  
   But I don't have strong feelings about it--as long as it doesn't result
   in the same known issues getting reported again and again
  
   Agreed, I don't think a Kconfig option is justified for this.  If this
   is really important, annotate broken paths so that it doesn't trigger
   spuriously; otherwise, please just remove it.
  
  
  Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
  
  Maybe, add a new lockdep API:
  
  lockdep_set_held_during_freeze(lock);
  
  Then when we do the check, ignore any locks that set this bit.
  
  Ingo, does this seem like a reasonable design to you?
 
 Am I reading the discussion correctly that the new warnings show REAL 
 potential 
 deadlock scenarios, which can hit real users and can lock their box up in 
 entirely 
 real usage scenarios?
 
 If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ 
 bugs.
 We typically use them to teach lockdep about things it does not know about.
 
 How about fixing the deadlocks instead?
 

I do see how the freezer might fail to suspend certain tasks, but I
don't see the deadlock scenario here in the NFS/RPC case. Can someone
outline a situation where this might end up deadlocking? If not, then
I'd be inclined to say that while this may be a problem, the warning is
excessive...

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Mandeep Singh Baines
On Wed, Mar 6, 2013 at 4:06 AM, Jeff Layton jlay...@redhat.com wrote:
 On Wed, 6 Mar 2013 10:09:14 +0100
 Ingo Molnar mi...@kernel.org wrote:


 * Mandeep Singh Baines m...@chromium.org wrote:

  On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo t...@kernel.org wrote:
   On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
   If it's really just a 2-line patch to try_to_freeze(), could it just be
   carried out-of-tree by people that are specifically working on tracking
   down these problems?
  
   But I don't have strong feelings about it--as long as it doesn't result
   in the same known issues getting reported again and again
  
   Agreed, I don't think a Kconfig option is justified for this.  If this
   is really important, annotate broken paths so that it doesn't trigger
   spuriously; otherwise, please just remove it.
  
 
  Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
 
  Maybe, add a new lockdep API:
 
  lockdep_set_held_during_freeze(lock);
 
  Then when we do the check, ignore any locks that set this bit.
 
  Ingo, does this seem like a reasonable design to you?

 Am I reading the discussion correctly that the new warnings show REAL 
 potential
 deadlock scenarios, which can hit real users and can lock their box up in 
 entirely
 real usage scenarios?

 If yes then guys we _really_ don't want to use lockdep annotation to _HIDE_ 
 bugs.
 We typically use them to teach lockdep about things it does not know about.

 How about fixing the deadlocks instead?


 I do see how the freezer might fail to suspend certain tasks, but I
 don't see the deadlock scenario here in the NFS/RPC case. Can someone
 outline a situation where this might end up deadlocking? If not, then
 I'd be inclined to say that while this may be a problem, the warning is
 excessive...


In general, holding a lock and freezing can cause a deadlock if:

1) you froze via the cgroup_freezer subsystem and a task in another
cgroup tried to acquire the same lock
2) the lock was needed later is suspend/hibernate. For example, if the
lock was needed in dpm_suspend by one of the device callbacks. For
hibernate, you also need to worry about any locks that need to be
acquired in order to write to the swap device.
3) another freezing task blocked on this lock and held other locks
needed later in suspend. If that task were skipped by the freezer, you
would deadlock

You will block/prevent suspend if:

4) another freezing task blocked on this lock and was unable to freeze

I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
cgroup freezer. Case 4) while not causing a deadlock could prevent
your laptop/phone from sleeping and end up burning all your battery.
If suspend is initiated via lid close you won't even know about the
failure.

Regards,
Mandeep

 --
 Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Oleg Nesterov
On 03/05, Tejun Heo wrote:

 Oleg, are you still opposed to the idea of making freezer share trap
 points with ptrace?

My memory can fool me, but iirc I wasn't actually opposed... I guess
you mean the previous discussion about vfork/ptrace/etc which I forgot
completely.

But I can recall the main problem with your idea (with me): I simply
wasn't able to understand it ;)

Likewise, I can't really understand the ideas discussed in this thread.
At least when it come to this particular problem, rpc_wait_bit_killable()
is not interruptible...

And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).

But if we can do this, then it should be possible so simply make these
sleeps TASK_INTERRUPTIBLE ? But it seems that we can't just because we
can't always restart, so I am totally confused.

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Oleg Nesterov
On 03/05, Jeff Layton wrote:

 Anyone up for working out how to handle a freeze event on a process
 that already has a pending signal, while it's being ptraced?

Could you explain the problem?

Oleg.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 07:59:01 -0800
Mandeep Singh Baines m...@chromium.org wrote:

 On Wed, Mar 6, 2013 at 4:06 AM, Jeff Layton jlay...@redhat.com wrote:
  On Wed, 6 Mar 2013 10:09:14 +0100
  Ingo Molnar mi...@kernel.org wrote:
 
 
  * Mandeep Singh Baines m...@chromium.org wrote:
 
   On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo t...@kernel.org wrote:
On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
If it's really just a 2-line patch to try_to_freeze(), could it just 
be
carried out-of-tree by people that are specifically working on 
tracking
down these problems?
   
But I don't have strong feelings about it--as long as it doesn't 
result
in the same known issues getting reported again and again
   
Agreed, I don't think a Kconfig option is justified for this.  If this
is really important, annotate broken paths so that it doesn't trigger
spuriously; otherwise, please just remove it.
   
  
   Fair enough. Let's revert then. I'll rework to use a lockdep annotation.
  
   Maybe, add a new lockdep API:
  
   lockdep_set_held_during_freeze(lock);
  
   Then when we do the check, ignore any locks that set this bit.
  
   Ingo, does this seem like a reasonable design to you?
 
  Am I reading the discussion correctly that the new warnings show REAL 
  potential
  deadlock scenarios, which can hit real users and can lock their box up in 
  entirely
  real usage scenarios?
 
  If yes then guys we _really_ don't want to use lockdep annotation to 
  _HIDE_ bugs.
  We typically use them to teach lockdep about things it does not know about.
 
  How about fixing the deadlocks instead?
 
 
  I do see how the freezer might fail to suspend certain tasks, but I
  don't see the deadlock scenario here in the NFS/RPC case. Can someone
  outline a situation where this might end up deadlocking? If not, then
  I'd be inclined to say that while this may be a problem, the warning is
  excessive...
 
 
 In general, holding a lock and freezing can cause a deadlock if:
 
 1) you froze via the cgroup_freezer subsystem and a task in another
 cgroup tried to acquire the same lock
 2) the lock was needed later is suspend/hibernate. For example, if the
 lock was needed in dpm_suspend by one of the device callbacks. For
 hibernate, you also need to worry about any locks that need to be
 acquired in order to write to the swap device.
 3) another freezing task blocked on this lock and held other locks
 needed later in suspend. If that task were skipped by the freezer, you
 would deadlock
 
 You will block/prevent suspend if:
 
 4) another freezing task blocked on this lock and was unable to freeze
 
 I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
 cgroup freezer. Case 4) while not causing a deadlock could prevent
 your laptop/phone from sleeping and end up burning all your battery.
 If suspend is initiated via lid close you won't even know about the
 failure.
 

We're aware of #4. That was the intent of adding try_to_freeze() into
this codepath in the first place. It's not a great solution for obvious
reasons, but we don't have another at the moment.

For #1 I'm not sure what to do. I'm that familiar with cgroups or how
the freezer works.

The bottom line is that we have a choice -- we can either rip out this
new lockdep warning, or rip out the code that causes it to fire.

If we rip out the warning we may miss some legit cases where we might
possibly have hit a deadlock.

If we rip out the code that causes it to fire, then we exacerbate the
#4 problem above. That will effectively make it so that you can't
suspend the host whenever NFS is doing anything moderately active.

Ripping out the warning seems like the best course of action in the
near term, but it's not my call...

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Myklebust, Trond
On Wed, 2013-03-06 at 13:23 -0500, Jeff Layton wrote:
 On Wed, 6 Mar 2013 07:59:01 -0800
 Mandeep Singh Baines m...@chromium.org wrote:
  In general, holding a lock and freezing can cause a deadlock if:
  
  1) you froze via the cgroup_freezer subsystem and a task in another
  cgroup tried to acquire the same lock
  2) the lock was needed later is suspend/hibernate. For example, if the
  lock was needed in dpm_suspend by one of the device callbacks. For
  hibernate, you also need to worry about any locks that need to be
  acquired in order to write to the swap device.
  3) another freezing task blocked on this lock and held other locks
  needed later in suspend. If that task were skipped by the freezer, you
  would deadlock
  
  You will block/prevent suspend if:
  
  4) another freezing task blocked on this lock and was unable to freeze
  
  I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
  cgroup freezer. Case 4) while not causing a deadlock could prevent
  your laptop/phone from sleeping and end up burning all your battery.
  If suspend is initiated via lid close you won't even know about the
  failure.
  
 
 We're aware of #4. That was the intent of adding try_to_freeze() into
 this codepath in the first place. It's not a great solution for obvious
 reasons, but we don't have another at the moment.
 
 For #1 I'm not sure what to do. I'm that familiar with cgroups or how
 the freezer works.
 
 The bottom line is that we have a choice -- we can either rip out this
 new lockdep warning, or rip out the code that causes it to fire.
 
 If we rip out the warning we may miss some legit cases where we might
 possibly have hit a deadlock.
 
 If we rip out the code that causes it to fire, then we exacerbate the
 #4 problem above. That will effectively make it so that you can't
 suspend the host whenever NFS is doing anything moderately active.

#4 is probably the only case where we might want to freeze.

Unless we're in a situation where the network is going down, we can
usually always make progress with completing the RPC call and finishing
the system call. So in the case of cgroup_freezer, we only care if the
freezing cgroup also owns the network device (or net namespace) that NFS
is using to talk to the server.

As I said, the alternative is to notify NFS that the device is going
down, and to give it a chance to quiesce itself before that happens.
This is also the only way to ensure that processes which own locks on
the server (e.g. posix file locks) have a chance to release them before
being suspended.


-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Jeff Layton
On Wed, 6 Mar 2013 19:17:35 +0100
Oleg Nesterov o...@redhat.com wrote:

 On 03/05, Jeff Layton wrote:
 
  Anyone up for working out how to handle a freeze event on a process
  that already has a pending signal, while it's being ptraced?
 
 Could you explain the problem?
 

Not very well. I was just saying that the signal/ptrace stuff is very
complicated already. Now we're looking at mixing in the freezer too,
which further increases the complexity.

Though when I said that, it was before Tejun mentioned hooking this up
to ptrace. I'll confess that I don't fully understand what he's
proposing either though...

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
Hello,

On Wed, Mar 06, 2013 at 01:40:04PM -0500, Jeff Layton wrote:
 Though when I said that, it was before Tejun mentioned hooking this up
 to ptrace. I'll confess that I don't fully understand what he's
 proposing either though...

Oh, they're all just pretty closely related.  All signal and ptrace
traps run off get_signal_to_deliver().  It is still a bit hairy but in
much better shape after cleanups which accompanied PTRACE_SEIZE and I
don't really think it'll be too difficult to extend it so that
freezing is implemented as a type of jobctl trap.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
Hello, Oleg.

On Wed, Mar 06, 2013 at 07:16:08PM +0100, Oleg Nesterov wrote:
 And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
 layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
 and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).

Oh yeah, we don't need another signal.  We just need sigpending state
and a wakeup.  I wasn't really going into details.  The important
point is that for code paths outside signal/ptrace, freezing could
look and behave about the same as signal delivery.

 But if we can do this, then it should be possible so simply make these
 sleeps TASK_INTERRUPTIBLE ? But it seems that we can't just because we
 can't always restart, so I am totally confused.

Of course, switching to another freezing mechanism doesn't make issues
like this go away in itself, but it would at least allow handling such
issues in easier to grasp way rather than dealing with this giant
pseudo lock and allows code paths outside jobctl to simply deal with
one issue - signal delivery, rather than having to deal with freezing
separately.  Also, while not completely identical, frozen state would
at least be an extension of jobctl trap and it would be possible to
allow things like killing frozen tasks or even ptracing them in
well-defined manner for cgroup_freezer use cases.  As it currently
stands, there is no well-defined abstraction for frozen state which we
can expose in any way, which sucks.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Mandeep Singh Baines
On Wed, Mar 6, 2013 at 10:37 AM, Myklebust, Trond
trond.mykleb...@netapp.com wrote:
 On Wed, 2013-03-06 at 13:23 -0500, Jeff Layton wrote:
 On Wed, 6 Mar 2013 07:59:01 -0800
 Mandeep Singh Baines m...@chromium.org wrote:
  In general, holding a lock and freezing can cause a deadlock if:
 
  1) you froze via the cgroup_freezer subsystem and a task in another
  cgroup tried to acquire the same lock
  2) the lock was needed later is suspend/hibernate. For example, if the
  lock was needed in dpm_suspend by one of the device callbacks. For
  hibernate, you also need to worry about any locks that need to be
  acquired in order to write to the swap device.
  3) another freezing task blocked on this lock and held other locks
  needed later in suspend. If that task were skipped by the freezer, you
  would deadlock
 
  You will block/prevent suspend if:
 
  4) another freezing task blocked on this lock and was unable to freeze
 
  I think 1) and 4) can happen for the NFS/RPC case. Case 1) requires
  cgroup freezer. Case 4) while not causing a deadlock could prevent
  your laptop/phone from sleeping and end up burning all your battery.
  If suspend is initiated via lid close you won't even know about the
  failure.
 

 We're aware of #4. That was the intent of adding try_to_freeze() into
 this codepath in the first place. It's not a great solution for obvious
 reasons, but we don't have another at the moment.

 For #1 I'm not sure what to do. I'm that familiar with cgroups or how
 the freezer works.

 The bottom line is that we have a choice -- we can either rip out this
 new lockdep warning, or rip out the code that causes it to fire.

 If we rip out the warning we may miss some legit cases where we might
 possibly have hit a deadlock.

 If we rip out the code that causes it to fire, then we exacerbate the
 #4 problem above. That will effectively make it so that you can't
 suspend the host whenever NFS is doing anything moderately active.

 #4 is probably the only case where we might want to freeze.

 Unless we're in a situation where the network is going down, we can
 usually always make progress with completing the RPC call and finishing
 the system call. So in the case of cgroup_freezer, we only care if the
 freezing cgroup also owns the network device (or net namespace) that NFS
 is using to talk to the server.

 As I said, the alternative is to notify NFS that the device is going
 down, and to give it a chance to quiesce itself before that happens.
 This is also the only way to ensure that processes which own locks on
 the server (e.g. posix file locks) have a chance to release them before
 being suspended.


So if I'm understanding correctly, the locks are held for bounded time
so under normal situation you don't really need to freeze here. But if
the task that is going to wake you were to freeze then this code path
could block suspend if PF_FREEZER_SKIP were not set.

For this particular case (not in general), instead of calling
try_to_freeze(), what if you only cleared PF_FREEZER_SKIP. For case
#1, you wouldn't block suspend if you're stuck waiting. If you
received the event while processes were freezing, you'd just
try_to_freeze() some place else. Probably on the way back up from the
system call. For case #4, you'd eventually complete the rpc and then
freeze some place else. If you're stuck waiting for the bit because
the process that was going to wake you also got frozen, then the
container would be stuck in the THAWING state. Not catastrophic and
something the administrator could workaround by sequencing things
right.

Does this sound like a reasonable workaround?

Regards,
Mandeep


 --
 Trond Myklebust
 Linux NFS client maintainer

 NetApp
 trond.mykleb...@netapp.com
 www.netapp.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Linus Torvalds
On Wed, Mar 6, 2013 at 10:53 AM, Tejun Heo t...@kernel.org wrote:
 Hello, Oleg.

 On Wed, Mar 06, 2013 at 07:16:08PM +0100, Oleg Nesterov wrote:
 And how SIGFREEZE can help? If we want to interrupt the sleeps in NFS/RPC
 layer we can simply add TASK_WAKEFREEZE (can be used with TASK_KILLABLE)
 and change freeze_task() to do signal_wake_up_state(TASK_WAKEFREEZE).

 Oh yeah, we don't need another signal.  We just need sigpending state
 and a wakeup.  I wasn't really going into details.  The important
 point is that for code paths outside signal/ptrace, freezing could
 look and behave about the same as signal delivery.

Don't we already do that? The whole try_to_freeze() in
get_signal_to_deliver() is about exactly this. See
fake_signal_wake_up().

You still have kernel threads (that don't do signals) to worry about,
so it doesn't make things go away. And you still have issues with
latency of disk wait, which is, I think, the reason for that
freezable_schedule() in the NFS code to begin with.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
Hello, Linus.

On Wed, Mar 06, 2013 at 01:00:02PM -0800, Linus Torvalds wrote:
  Oh yeah, we don't need another signal.  We just need sigpending state
  and a wakeup.  I wasn't really going into details.  The important
  point is that for code paths outside signal/ptrace, freezing could
  look and behave about the same as signal delivery.
 
 Don't we already do that? The whole try_to_freeze() in
 get_signal_to_deliver() is about exactly this. See
 fake_signal_wake_up().

Yeap, that was what I had in mind too.  Maybe we'll need to modify it
slightly but we already have most of the basic stuff.

 You still have kernel threads (that don't do signals) to worry about,
 so it doesn't make things go away. And you still have issues with
 latency of disk wait, which is, I think, the reason for that
 freezable_schedule() in the NFS code to begin with.

I haven't thought about it for quite some time so things are hazy, but
here's what I can recall now.

With syscall paths out of the way, the surface is reduced a lot.
Another part is converting most freezable kthread users to freezable
workqueue which provides natural resource boundaries (the duration of
work item execution).  kthread is already difficult to get the
synchronization completely right and significant number of freezable +
should_stop users are subtly broken the last time I went over the
freezer users.  I think we would be much better off converting most
over to freezable workqueues which is easier to get right and likely
to be less expensive.  Freezing happens at work item boundary which in
most cases could be made to coincide with the original freezer check
point.

There could be kthreads which can't be converted to workqueue for
whatever reason (there shouldn't be many at this point) but most
freezer usages in kthreads are pretty simple.  It's usually single or
a couple freezer check points in the main loop.  While we may still
need special handling for them, I don't think they're likely to have
implications on issues like this.

We probably would want to handle restart for freezable kthreads
calling syscalls.  Haven't thought about this one too much yet.  Maybe
freezable kthreads doing syscalls just need to be ready for
-ERESTARTSYS?

I'm not sure I follow the disk wait latency part.  Are you saying that
switching to jobctl trap based freezer implementation wouldn't help
them?  If so, right, it doesn't in itself.  It's just changing the
infrastructure used for freezing and can't make the underlying
synchronization issues just disappear but at least it becomes the same
problem as being responsive to SIGKILL rather than a completely
separate problem.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Linus Torvalds
On Wed, Mar 6, 2013 at 1:24 PM, Tejun Heo t...@kernel.org wrote:

 With syscall paths out of the way, the surface is reduced a lot.

So the issue is syscalls that don't react to signals, and that can
potentially wait a long time.

Like NFS with a network hickup. Which is not at all unlikely. Think
wireless network, somebody trying to get on a network share, things
not working, and closing the damn lid because you give up.

So I do agree that we probably have *too* many of the stupid let's
check if we can freeze, and I suspect that the NFS code should get
rid of the freezable_schedule() that is causing this warning
(because I also agree that you should *not* freeze while holding
locks, because it really can cause deadlocks), but I do suspect that
network filesystems do need to have a few places where they check for
freezing on their own... Exactly because freezing isn't *quite* like a
signal.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
 So I do agree that we probably have *too* many of the stupid let's
 check if we can freeze, and I suspect that the NFS code should get
 rid of the freezable_schedule() that is causing this warning
 (because I also agree that you should *not* freeze while holding
 locks, because it really can cause deadlocks), but I do suspect that
 network filesystems do need to have a few places where they check for
 freezing on their own... Exactly because freezing isn't *quite* like a
 signal.

Well, I don't really know much about nfs so I can't really tell, but
for most other cases, dealing with freezing like a signal should work
fine from what I've seen although I can't be sure before actually
trying.  Trond, Bruce, can you guys please chime in?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-06 Thread Tejun Heo
On Wed, Mar 06, 2013 at 01:36:36PM -0800, Tejun Heo wrote:
 On Wed, Mar 06, 2013 at 01:31:10PM -0800, Linus Torvalds wrote:
  So I do agree that we probably have *too* many of the stupid let's
  check if we can freeze, and I suspect that the NFS code should get
  rid of the freezable_schedule() that is causing this warning
  (because I also agree that you should *not* freeze while holding
  locks, because it really can cause deadlocks), but I do suspect that
  network filesystems do need to have a few places where they check for
  freezing on their own... Exactly because freezing isn't *quite* like a
  signal.
 
 Well, I don't really know much about nfs so I can't really tell, but
 for most other cases, dealing with freezing like a signal should work
 fine from what I've seen although I can't be sure before actually
 trying.  Trond, Bruce, can you guys please chime in?

So, I think the question here would be, in nfs, how many of the
current freezer check points would be difficult to conver to signal
handling model after excluding the ones which are performed while
holding some locks which we need to get rid of anyway?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Mandeep Singh Baines
On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo  wrote:
> On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
>> If it's really just a 2-line patch to try_to_freeze(), could it just be
>> carried out-of-tree by people that are specifically working on tracking
>> down these problems?
>>
>> But I don't have strong feelings about it--as long as it doesn't result
>> in the same known issues getting reported again and again
>
> Agreed, I don't think a Kconfig option is justified for this.  If this
> is really important, annotate broken paths so that it doesn't trigger
> spuriously; otherwise, please just remove it.
>

Fair enough. Let's revert then. I'll rework to use a lockdep annotation.

Maybe, add a new lockdep API:

lockdep_set_held_during_freeze(lock);

Then when we do the check, ignore any locks that set this bit.

Ingo, does this seem like a reasonable design to you?

Regards,
Mandeep

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
On Tue, Mar 05, 2013 at 05:14:23PM -0800, Tejun Heo wrote:
> Then, the operation simply isn't freezable while in progress and
> should be on the receiving end of failed-to-freeze error message and
> users who depend on suspend/hibernation working properly should be
> advised away from using nfs.
> 
> It doesn't really changes anything.  The current code is buggy.  We're
> just not enforcing the rules strictly and people aren't hitting the
> bug often enough.

Just one more thing.  Also, not being able to do retries without
side-effect doesn't mean it can't do retries at all.  There are
syscalls which need to do things differently on re-entry (forgot which
one but there are several).  They record the necessary state in the
restart block and resume from where they left off on re-entry.  It
sure is hairy but is doable and we already have supporting
infrastructure.  Not sure whether that would be worthwhile tho.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
> If it's really just a 2-line patch to try_to_freeze(), could it just be
> carried out-of-tree by people that are specifically working on tracking
> down these problems?
> 
> But I don't have strong feelings about it--as long as it doesn't result
> in the same known issues getting reported again and again

Agreed, I don't think a Kconfig option is justified for this.  If this
is really important, annotate broken paths so that it doesn't trigger
spuriously; otherwise, please just remove it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, Trond.

On Wed, Mar 06, 2013 at 01:10:07AM +, Myklebust, Trond wrote:
> Not all RPC calls can just be interrupted and restarted. Something like
> an exclusive file create, unlink, file locking attempt, etc may give
> rise to different results when replayed in the above scenario.
> Interrupting an RPC call is not equivalent to cancelling its effects...

Then, the operation simply isn't freezable while in progress and
should be on the receiving end of failed-to-freeze error message and
users who depend on suspend/hibernation working properly should be
advised away from using nfs.

It doesn't really changes anything.  The current code is buggy.  We're
just not enforcing the rules strictly and people aren't hitting the
bug often enough.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Myklebust, Trond
On Tue, 2013-03-05 at 14:03 -0500, Jeff Layton wrote:
> On Tue, 5 Mar 2013 09:49:54 -0800
> Tejun Heo  wrote:
> 
> > On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > > So, I think this is why implementing freezer as a separate blocking
> > > mechanism isn't such a good idea.  We're effectively introducing a
> > > completely new waiting state to a lot of unsuspecting paths which
> > > generates a lot of risks and eventually extra complexity to work
> > > around those.  I think we really should update freezer to re-use the
> > > blocking points we already have - the ones used for signal delivery
> > > and ptracing.  That way, other code paths don't have to worry about an
> > > extra stop state and we can confine most complexities to freezer
> > > proper.
> > 
> > Also, consolidating those wait states means that we can solve the
> > event-to-response latency problem for all three cases - signal, ptrace
> > and freezer, rather than adding separate backing-out strategy for
> > freezer.
> > 
> 
> Sounds intriguing...
> 
> I'm not sure what this really means for something like NFS though. How
> would you envision this working when we have long running syscalls that
> might sit waiting in the kernel indefinitely?
> 
> Here's my blue-sky, poorly-thought-out idea...
> 
> We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> NFS/RPC layer to be interrupted. Those would return back toward
> userland with a particular type of error (sort of like ERESTARTSYS).
> 
> Before returning from the kernel though, we could freeze the process.
> When it wakes up, then we could go back down and retry the call again
> (much like an ERESTARTSYS kind of thing).

Two (three?) show stopper words for you: "non-idempotent operations".

Not all RPC calls can just be interrupted and restarted. Something like
an exclusive file create, unlink, file locking attempt, etc may give
rise to different results when replayed in the above scenario.
Interrupting an RPC call is not equivalent to cancelling its effects...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread J. Bruce Fields
On Tue, Mar 05, 2013 at 04:59:00PM -0800, Mandeep Singh Baines wrote:
> On Tue, Mar 5, 2013 at 3:11 PM, J. Bruce Fields  wrote:
> > On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
> >> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> >> > So, I think this is why implementing freezer as a separate blocking
> >> > mechanism isn't such a good idea.  We're effectively introducing a
> >> > completely new waiting state to a lot of unsuspecting paths which
> >> > generates a lot of risks and eventually extra complexity to work
> >> > around those.  I think we really should update freezer to re-use the
> >> > blocking points we already have - the ones used for signal delivery
> >> > and ptracing.  That way, other code paths don't have to worry about an
> >> > extra stop state and we can confine most complexities to freezer
> >> > proper.
> >>
> >> Also, consolidating those wait states means that we can solve the
> >> event-to-response latency problem for all three cases - signal, ptrace
> >> and freezer, rather than adding separate backing-out strategy for
> >> freezer.
> >
> > Meanwhile, as none of this sounds likely to be done this time
> > around--are we backing out the new lockdep warnings?
> >
> > --b.
> 
> What if we hide it behind a Kconfig? Its finding real bugs.
> 
> http://lkml.org/lkml/2013/3/5/583

If it's really just a 2-line patch to try_to_freeze(), could it just be
carried out-of-tree by people that are specifically working on tracking
down these problems?

But I don't have strong feelings about it--as long as it doesn't result
in the same known issues getting reported again and again

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Mandeep Singh Baines
On Tue, Mar 5, 2013 at 3:11 PM, J. Bruce Fields  wrote:
> On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
>> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
>> > So, I think this is why implementing freezer as a separate blocking
>> > mechanism isn't such a good idea.  We're effectively introducing a
>> > completely new waiting state to a lot of unsuspecting paths which
>> > generates a lot of risks and eventually extra complexity to work
>> > around those.  I think we really should update freezer to re-use the
>> > blocking points we already have - the ones used for signal delivery
>> > and ptracing.  That way, other code paths don't have to worry about an
>> > extra stop state and we can confine most complexities to freezer
>> > proper.
>>
>> Also, consolidating those wait states means that we can solve the
>> event-to-response latency problem for all three cases - signal, ptrace
>> and freezer, rather than adding separate backing-out strategy for
>> freezer.
>
> Meanwhile, as none of this sounds likely to be done this time
> around--are we backing out the new lockdep warnings?
>
> --b.

What if we hide it behind a Kconfig? Its finding real bugs.

http://lkml.org/lkml/2013/3/5/583
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Rafael J. Wysocki
On Tuesday, March 05, 2013 06:11:10 PM J. Bruce Fields wrote:
> On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
> > On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > > So, I think this is why implementing freezer as a separate blocking
> > > mechanism isn't such a good idea.  We're effectively introducing a
> > > completely new waiting state to a lot of unsuspecting paths which
> > > generates a lot of risks and eventually extra complexity to work
> > > around those.  I think we really should update freezer to re-use the
> > > blocking points we already have - the ones used for signal delivery
> > > and ptracing.  That way, other code paths don't have to worry about an
> > > extra stop state and we can confine most complexities to freezer
> > > proper.
> > 
> > Also, consolidating those wait states means that we can solve the
> > event-to-response latency problem for all three cases - signal, ptrace
> > and freezer, rather than adding separate backing-out strategy for
> > freezer.
> 
> Meanwhile, as none of this sounds likely to be done this time
> around--are we backing out the new lockdep warnings?

I think we should do that.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, Jeff.

On Tue, Mar 05, 2013 at 06:39:41PM -0500, Jeff Layton wrote:
> Al was in the middle of his signal handling/execve rework though and I
> ran the idea past him. He pointedly told me that I was crazy for even
> considering it. This is rather non-trivial to handle since it means
> mucking around in a bunch of arch-specific code and dealing with all of
> the weirdo corner cases.

Hmmm... I tried it a couple years ago while I was working on ptrace
improvements to support checkpoint-restart in userland (PTRACE_SEIZE
and friends) and had a half-working code.  At the time, Oleg was
against the idea and something else came up so I didn't push it all
the way but IIRC it didn't need to touch any arch code.  It should be
able to just share the existing trap path with ptrace.

> Anyone up for working out how to handle a freeze event on a process
> that already has a pending signal, while it's being ptraced? It's
> probably best to chat with Al (cc'ed here) before you embark on this
> plan since he was just in that code recently.

Maybe Al sees something that I don't but AFAICS it should be doable in
generic code proper and I don't think it's gonna be that hard.

Oleg, are you still opposed to the idea of making freezer share trap
points with ptrace?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Jeff Layton
On Tue, 5 Mar 2013 11:09:23 -0800
Tejun Heo  wrote:

> Hello, Jeff.
> 
> On Tue, Mar 05, 2013 at 02:03:12PM -0500, Jeff Layton wrote:
> > Sounds intriguing...
> > 
> > I'm not sure what this really means for something like NFS though. How
> > would you envision this working when we have long running syscalls that
> > might sit waiting in the kernel indefinitely?
> 
> I think it is the same problem as being able to handle SIGKILL in
> responsive manner.  It could be tricky to implement for nfs but it at
> least doesn't have to solve the problem twice.
> 
> > Here's my blue-sky, poorly-thought-out idea...
> > 
> > We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> > NFS/RPC layer to be interrupted. Those would return back toward
> > userland with a particular type of error (sort of like ERESTARTSYS).
> > 
> > Before returning from the kernel though, we could freeze the process.
> > When it wakes up, then we could go back down and retry the call again
> > (much like an ERESTARTSYS kind of thing).
> > 
> > The tricky part here is that we'd need to distinguish between the case
> > where we caught SIGFREEZE before sending an RPC vs. after. If we sent
> > the call before freezing, then we don't want to resend it again. It
> > might be a non-idempotent operation.
> 
> So, yeah, you are thinking pretty much the same as I'm.
> 
> > Sounds horrific to code up though... :)
> 
> I don't know the details of nfs but those events could essentially be
> signaling that the system is gonna lose power.  I think it would be a
> good idea to solve it.
> 
> Thanks.
> 

It would be...

So, I briefly considered a similar approach when I was working on the
retry-on-ESTALE error patches. It occurred to me that it was somewhat
similar to the ERESTARTSYS case, so handling it at a higher level than
in the syscall handlers themselves might make sense...

Al was in the middle of his signal handling/execve rework though and I
ran the idea past him. He pointedly told me that I was crazy for even
considering it. This is rather non-trivial to handle since it means
mucking around in a bunch of arch-specific code and dealing with all of
the weirdo corner cases.

Anyone up for working out how to handle a freeze event on a process
that already has a pending signal, while it's being ptraced? It's
probably best to chat with Al (cc'ed here) before you embark on this
plan since he was just in that code recently.

In any case, maybe there's also some code consolidation opportunity
here too. I suspect at least some of this logic is in arch-specific
code when it really needn't be

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread J. Bruce Fields
On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > So, I think this is why implementing freezer as a separate blocking
> > mechanism isn't such a good idea.  We're effectively introducing a
> > completely new waiting state to a lot of unsuspecting paths which
> > generates a lot of risks and eventually extra complexity to work
> > around those.  I think we really should update freezer to re-use the
> > blocking points we already have - the ones used for signal delivery
> > and ptracing.  That way, other code paths don't have to worry about an
> > extra stop state and we can confine most complexities to freezer
> > proper.
> 
> Also, consolidating those wait states means that we can solve the
> event-to-response latency problem for all three cases - signal, ptrace
> and freezer, rather than adding separate backing-out strategy for
> freezer.

Meanwhile, as none of this sounds likely to be done this time
around--are we backing out the new lockdep warnings?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, Jeff.

On Tue, Mar 05, 2013 at 02:03:12PM -0500, Jeff Layton wrote:
> Sounds intriguing...
> 
> I'm not sure what this really means for something like NFS though. How
> would you envision this working when we have long running syscalls that
> might sit waiting in the kernel indefinitely?

I think it is the same problem as being able to handle SIGKILL in
responsive manner.  It could be tricky to implement for nfs but it at
least doesn't have to solve the problem twice.

> Here's my blue-sky, poorly-thought-out idea...
> 
> We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
> NFS/RPC layer to be interrupted. Those would return back toward
> userland with a particular type of error (sort of like ERESTARTSYS).
> 
> Before returning from the kernel though, we could freeze the process.
> When it wakes up, then we could go back down and retry the call again
> (much like an ERESTARTSYS kind of thing).
> 
> The tricky part here is that we'd need to distinguish between the case
> where we caught SIGFREEZE before sending an RPC vs. after. If we sent
> the call before freezing, then we don't want to resend it again. It
> might be a non-idempotent operation.

So, yeah, you are thinking pretty much the same as I'm.

> Sounds horrific to code up though... :)

I don't know the details of nfs but those events could essentially be
signaling that the system is gonna lose power.  I think it would be a
good idea to solve it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Jeff Layton
On Tue, 5 Mar 2013 09:49:54 -0800
Tejun Heo  wrote:

> On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> > So, I think this is why implementing freezer as a separate blocking
> > mechanism isn't such a good idea.  We're effectively introducing a
> > completely new waiting state to a lot of unsuspecting paths which
> > generates a lot of risks and eventually extra complexity to work
> > around those.  I think we really should update freezer to re-use the
> > blocking points we already have - the ones used for signal delivery
> > and ptracing.  That way, other code paths don't have to worry about an
> > extra stop state and we can confine most complexities to freezer
> > proper.
> 
> Also, consolidating those wait states means that we can solve the
> event-to-response latency problem for all three cases - signal, ptrace
> and freezer, rather than adding separate backing-out strategy for
> freezer.
> 

Sounds intriguing...

I'm not sure what this really means for something like NFS though. How
would you envision this working when we have long running syscalls that
might sit waiting in the kernel indefinitely?

Here's my blue-sky, poorly-thought-out idea...

We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
NFS/RPC layer to be interrupted. Those would return back toward
userland with a particular type of error (sort of like ERESTARTSYS).

Before returning from the kernel though, we could freeze the process.
When it wakes up, then we could go back down and retry the call again
(much like an ERESTARTSYS kind of thing).

The tricky part here is that we'd need to distinguish between the case
where we caught SIGFREEZE before sending an RPC vs. after. If we sent
the call before freezing, then we don't want to resend it again. It
might be a non-idempotent operation.

Sounds horrific to code up though... :)

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
> So, I think this is why implementing freezer as a separate blocking
> mechanism isn't such a good idea.  We're effectively introducing a
> completely new waiting state to a lot of unsuspecting paths which
> generates a lot of risks and eventually extra complexity to work
> around those.  I think we really should update freezer to re-use the
> blocking points we already have - the ones used for signal delivery
> and ptracing.  That way, other code paths don't have to worry about an
> extra stop state and we can confine most complexities to freezer
> proper.

Also, consolidating those wait states means that we can solve the
event-to-response latency problem for all three cases - signal, ptrace
and freezer, rather than adding separate backing-out strategy for
freezer.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, guys.

On Tue, Mar 05, 2013 at 08:23:08AM -0500, Jeff Layton wrote:
> So, not a deadlock per-se in this case but it does prevent the freezer
> from running to completion. I don't see any way to solve it though w/o
> making all mutexes freezable. Note that I don't think this is really
> limited to NFS either -- a lot of other filesystems will have similar
> problems: CIFS, some FUSE variants, etc...

So, I think this is why implementing freezer as a separate blocking
mechanism isn't such a good idea.  We're effectively introducing a
completely new waiting state to a lot of unsuspecting paths which
generates a lot of risks and eventually extra complexity to work
around those.  I think we really should update freezer to re-use the
blocking points we already have - the ones used for signal delivery
and ptracing.  That way, other code paths don't have to worry about an
extra stop state and we can confine most complexities to freezer
proper.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Jeff Layton
On Mon, 4 Mar 2013 22:08:34 +
"Myklebust, Trond"  wrote:

> On Mon, 2013-03-04 at 21:53 +0100, Oleg Nesterov wrote:
> > On 03/04, Mandeep Singh Baines wrote:
> > >
> > > The problem is that freezer_count() calls try_to_freeze(). In this
> > > case, try_to_freeze() is not really adding any value.
> > 
> > Well, I tend to agree.
> > 
> > If a task calls __refrigerator() holding a lock which another freezable
> > task can wait for, this is not freezer-friendly.
> > 
> > freezable_schedule/freezer_do_not_count/etc not only means "I won't be
> > active if freezing()", it should also mean "I won't block suspend/etc".
> 
> If suspend for some reason requires a re-entrant mount, then yes, I can
> see how it might be a problem that we're holding a mount-related lock.
> The question is why is that necessary?
> 
> > OTOH, I understand that probably it is not trivial to change this code
> > to make it freezer-friendly. But at least  I disagree with "push your
> > problems onto others".
> 
> That code can't be made freezer-friendly if it isn't allowed to hold
> basic filesystem-related locks across RPC calls. A number of those RPC
> calls do things that need to be protected by VFS or MM-level locks.
> i.e.: lookups, file creation/deletion, page fault in/out, ...
>  
> IOW: the problem would need to be solved differently, possibly by adding
> a new FIFREEZE-like call to allow the filesystem to quiesce itself
> _before_ NetworkManager pulls the rug out from underneath it. There
> would still be plenty of corner cases to keep people entertained (e.g.
> the server goes down before the quiesce call is invoked) but at least
> the top 90% of cases would be solved.
> 

Ok, I think I'm starting to get it. It doesn't necessarily need a
reentrant mount or anything like that. Consider this case (which is not
even that unlikely):

Suppose there are two tasks calling unlink() on files in the same NFS
directory.

First task takes the i_mutex on the parent directory and goes to ask
the server to remove the file. Second task calls unlink just
afterward and blocks on the parent's i_mutex.

Now, a suspend event comes in and freezes the first task while it's
waiting on the response. It still holds the parent's i_mutex. Freezer
now gets to the second task and can't freeze it because the sleep on
that mutex isn't freezable.

So, not a deadlock per-se in this case but it does prevent the freezer
from running to completion. I don't see any way to solve it though w/o
making all mutexes freezable. Note that I don't think this is really
limited to NFS either -- a lot of other filesystems will have similar
problems: CIFS, some FUSE variants, etc...

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Jeff Layton
On Mon, 4 Mar 2013 22:08:34 +
Myklebust, Trond trond.mykleb...@netapp.com wrote:

 On Mon, 2013-03-04 at 21:53 +0100, Oleg Nesterov wrote:
  On 03/04, Mandeep Singh Baines wrote:
  
   The problem is that freezer_count() calls try_to_freeze(). In this
   case, try_to_freeze() is not really adding any value.
  
  Well, I tend to agree.
  
  If a task calls __refrigerator() holding a lock which another freezable
  task can wait for, this is not freezer-friendly.
  
  freezable_schedule/freezer_do_not_count/etc not only means I won't be
  active if freezing(), it should also mean I won't block suspend/etc.
 
 If suspend for some reason requires a re-entrant mount, then yes, I can
 see how it might be a problem that we're holding a mount-related lock.
 The question is why is that necessary?
 
  OTOH, I understand that probably it is not trivial to change this code
  to make it freezer-friendly. But at least  I disagree with push your
  problems onto others.
 
 That code can't be made freezer-friendly if it isn't allowed to hold
 basic filesystem-related locks across RPC calls. A number of those RPC
 calls do things that need to be protected by VFS or MM-level locks.
 i.e.: lookups, file creation/deletion, page fault in/out, ...
  
 IOW: the problem would need to be solved differently, possibly by adding
 a new FIFREEZE-like call to allow the filesystem to quiesce itself
 _before_ NetworkManager pulls the rug out from underneath it. There
 would still be plenty of corner cases to keep people entertained (e.g.
 the server goes down before the quiesce call is invoked) but at least
 the top 90% of cases would be solved.
 

Ok, I think I'm starting to get it. It doesn't necessarily need a
reentrant mount or anything like that. Consider this case (which is not
even that unlikely):

Suppose there are two tasks calling unlink() on files in the same NFS
directory.

First task takes the i_mutex on the parent directory and goes to ask
the server to remove the file. Second task calls unlink just
afterward and blocks on the parent's i_mutex.

Now, a suspend event comes in and freezes the first task while it's
waiting on the response. It still holds the parent's i_mutex. Freezer
now gets to the second task and can't freeze it because the sleep on
that mutex isn't freezable.

So, not a deadlock per-se in this case but it does prevent the freezer
from running to completion. I don't see any way to solve it though w/o
making all mutexes freezable. Note that I don't think this is really
limited to NFS either -- a lot of other filesystems will have similar
problems: CIFS, some FUSE variants, etc...

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, guys.

On Tue, Mar 05, 2013 at 08:23:08AM -0500, Jeff Layton wrote:
 So, not a deadlock per-se in this case but it does prevent the freezer
 from running to completion. I don't see any way to solve it though w/o
 making all mutexes freezable. Note that I don't think this is really
 limited to NFS either -- a lot of other filesystems will have similar
 problems: CIFS, some FUSE variants, etc...

So, I think this is why implementing freezer as a separate blocking
mechanism isn't such a good idea.  We're effectively introducing a
completely new waiting state to a lot of unsuspecting paths which
generates a lot of risks and eventually extra complexity to work
around those.  I think we really should update freezer to re-use the
blocking points we already have - the ones used for signal delivery
and ptracing.  That way, other code paths don't have to worry about an
extra stop state and we can confine most complexities to freezer
proper.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
 So, I think this is why implementing freezer as a separate blocking
 mechanism isn't such a good idea.  We're effectively introducing a
 completely new waiting state to a lot of unsuspecting paths which
 generates a lot of risks and eventually extra complexity to work
 around those.  I think we really should update freezer to re-use the
 blocking points we already have - the ones used for signal delivery
 and ptracing.  That way, other code paths don't have to worry about an
 extra stop state and we can confine most complexities to freezer
 proper.

Also, consolidating those wait states means that we can solve the
event-to-response latency problem for all three cases - signal, ptrace
and freezer, rather than adding separate backing-out strategy for
freezer.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Jeff Layton
On Tue, 5 Mar 2013 09:49:54 -0800
Tejun Heo t...@kernel.org wrote:

 On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
  So, I think this is why implementing freezer as a separate blocking
  mechanism isn't such a good idea.  We're effectively introducing a
  completely new waiting state to a lot of unsuspecting paths which
  generates a lot of risks and eventually extra complexity to work
  around those.  I think we really should update freezer to re-use the
  blocking points we already have - the ones used for signal delivery
  and ptracing.  That way, other code paths don't have to worry about an
  extra stop state and we can confine most complexities to freezer
  proper.
 
 Also, consolidating those wait states means that we can solve the
 event-to-response latency problem for all three cases - signal, ptrace
 and freezer, rather than adding separate backing-out strategy for
 freezer.
 

Sounds intriguing...

I'm not sure what this really means for something like NFS though. How
would you envision this working when we have long running syscalls that
might sit waiting in the kernel indefinitely?

Here's my blue-sky, poorly-thought-out idea...

We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
NFS/RPC layer to be interrupted. Those would return back toward
userland with a particular type of error (sort of like ERESTARTSYS).

Before returning from the kernel though, we could freeze the process.
When it wakes up, then we could go back down and retry the call again
(much like an ERESTARTSYS kind of thing).

The tricky part here is that we'd need to distinguish between the case
where we caught SIGFREEZE before sending an RPC vs. after. If we sent
the call before freezing, then we don't want to resend it again. It
might be a non-idempotent operation.

Sounds horrific to code up though... :)

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, Jeff.

On Tue, Mar 05, 2013 at 02:03:12PM -0500, Jeff Layton wrote:
 Sounds intriguing...
 
 I'm not sure what this really means for something like NFS though. How
 would you envision this working when we have long running syscalls that
 might sit waiting in the kernel indefinitely?

I think it is the same problem as being able to handle SIGKILL in
responsive manner.  It could be tricky to implement for nfs but it at
least doesn't have to solve the problem twice.

 Here's my blue-sky, poorly-thought-out idea...
 
 We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
 NFS/RPC layer to be interrupted. Those would return back toward
 userland with a particular type of error (sort of like ERESTARTSYS).
 
 Before returning from the kernel though, we could freeze the process.
 When it wakes up, then we could go back down and retry the call again
 (much like an ERESTARTSYS kind of thing).
 
 The tricky part here is that we'd need to distinguish between the case
 where we caught SIGFREEZE before sending an RPC vs. after. If we sent
 the call before freezing, then we don't want to resend it again. It
 might be a non-idempotent operation.

So, yeah, you are thinking pretty much the same as I'm.

 Sounds horrific to code up though... :)

I don't know the details of nfs but those events could essentially be
signaling that the system is gonna lose power.  I think it would be a
good idea to solve it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread J. Bruce Fields
On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
 On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
  So, I think this is why implementing freezer as a separate blocking
  mechanism isn't such a good idea.  We're effectively introducing a
  completely new waiting state to a lot of unsuspecting paths which
  generates a lot of risks and eventually extra complexity to work
  around those.  I think we really should update freezer to re-use the
  blocking points we already have - the ones used for signal delivery
  and ptracing.  That way, other code paths don't have to worry about an
  extra stop state and we can confine most complexities to freezer
  proper.
 
 Also, consolidating those wait states means that we can solve the
 event-to-response latency problem for all three cases - signal, ptrace
 and freezer, rather than adding separate backing-out strategy for
 freezer.

Meanwhile, as none of this sounds likely to be done this time
around--are we backing out the new lockdep warnings?

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Jeff Layton
On Tue, 5 Mar 2013 11:09:23 -0800
Tejun Heo t...@kernel.org wrote:

 Hello, Jeff.
 
 On Tue, Mar 05, 2013 at 02:03:12PM -0500, Jeff Layton wrote:
  Sounds intriguing...
  
  I'm not sure what this really means for something like NFS though. How
  would you envision this working when we have long running syscalls that
  might sit waiting in the kernel indefinitely?
 
 I think it is the same problem as being able to handle SIGKILL in
 responsive manner.  It could be tricky to implement for nfs but it at
 least doesn't have to solve the problem twice.
 
  Here's my blue-sky, poorly-thought-out idea...
  
  We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
  NFS/RPC layer to be interrupted. Those would return back toward
  userland with a particular type of error (sort of like ERESTARTSYS).
  
  Before returning from the kernel though, we could freeze the process.
  When it wakes up, then we could go back down and retry the call again
  (much like an ERESTARTSYS kind of thing).
  
  The tricky part here is that we'd need to distinguish between the case
  where we caught SIGFREEZE before sending an RPC vs. after. If we sent
  the call before freezing, then we don't want to resend it again. It
  might be a non-idempotent operation.
 
 So, yeah, you are thinking pretty much the same as I'm.
 
  Sounds horrific to code up though... :)
 
 I don't know the details of nfs but those events could essentially be
 signaling that the system is gonna lose power.  I think it would be a
 good idea to solve it.
 
 Thanks.
 

It would be...

So, I briefly considered a similar approach when I was working on the
retry-on-ESTALE error patches. It occurred to me that it was somewhat
similar to the ERESTARTSYS case, so handling it at a higher level than
in the syscall handlers themselves might make sense...

Al was in the middle of his signal handling/execve rework though and I
ran the idea past him. He pointedly told me that I was crazy for even
considering it. This is rather non-trivial to handle since it means
mucking around in a bunch of arch-specific code and dealing with all of
the weirdo corner cases.

Anyone up for working out how to handle a freeze event on a process
that already has a pending signal, while it's being ptraced? It's
probably best to chat with Al (cc'ed here) before you embark on this
plan since he was just in that code recently.

In any case, maybe there's also some code consolidation opportunity
here too. I suspect at least some of this logic is in arch-specific
code when it really needn't be

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, Jeff.

On Tue, Mar 05, 2013 at 06:39:41PM -0500, Jeff Layton wrote:
 Al was in the middle of his signal handling/execve rework though and I
 ran the idea past him. He pointedly told me that I was crazy for even
 considering it. This is rather non-trivial to handle since it means
 mucking around in a bunch of arch-specific code and dealing with all of
 the weirdo corner cases.

Hmmm... I tried it a couple years ago while I was working on ptrace
improvements to support checkpoint-restart in userland (PTRACE_SEIZE
and friends) and had a half-working code.  At the time, Oleg was
against the idea and something else came up so I didn't push it all
the way but IIRC it didn't need to touch any arch code.  It should be
able to just share the existing trap path with ptrace.

 Anyone up for working out how to handle a freeze event on a process
 that already has a pending signal, while it's being ptraced? It's
 probably best to chat with Al (cc'ed here) before you embark on this
 plan since he was just in that code recently.

Maybe Al sees something that I don't but AFAICS it should be doable in
generic code proper and I don't think it's gonna be that hard.

Oleg, are you still opposed to the idea of making freezer share trap
points with ptrace?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Rafael J. Wysocki
On Tuesday, March 05, 2013 06:11:10 PM J. Bruce Fields wrote:
 On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
  On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
   So, I think this is why implementing freezer as a separate blocking
   mechanism isn't such a good idea.  We're effectively introducing a
   completely new waiting state to a lot of unsuspecting paths which
   generates a lot of risks and eventually extra complexity to work
   around those.  I think we really should update freezer to re-use the
   blocking points we already have - the ones used for signal delivery
   and ptracing.  That way, other code paths don't have to worry about an
   extra stop state and we can confine most complexities to freezer
   proper.
  
  Also, consolidating those wait states means that we can solve the
  event-to-response latency problem for all three cases - signal, ptrace
  and freezer, rather than adding separate backing-out strategy for
  freezer.
 
 Meanwhile, as none of this sounds likely to be done this time
 around--are we backing out the new lockdep warnings?

I think we should do that.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Mandeep Singh Baines
On Tue, Mar 5, 2013 at 3:11 PM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
 On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
  So, I think this is why implementing freezer as a separate blocking
  mechanism isn't such a good idea.  We're effectively introducing a
  completely new waiting state to a lot of unsuspecting paths which
  generates a lot of risks and eventually extra complexity to work
  around those.  I think we really should update freezer to re-use the
  blocking points we already have - the ones used for signal delivery
  and ptracing.  That way, other code paths don't have to worry about an
  extra stop state and we can confine most complexities to freezer
  proper.

 Also, consolidating those wait states means that we can solve the
 event-to-response latency problem for all three cases - signal, ptrace
 and freezer, rather than adding separate backing-out strategy for
 freezer.

 Meanwhile, as none of this sounds likely to be done this time
 around--are we backing out the new lockdep warnings?

 --b.

What if we hide it behind a Kconfig? Its finding real bugs.

http://lkml.org/lkml/2013/3/5/583
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread J. Bruce Fields
On Tue, Mar 05, 2013 at 04:59:00PM -0800, Mandeep Singh Baines wrote:
 On Tue, Mar 5, 2013 at 3:11 PM, J. Bruce Fields bfie...@fieldses.org wrote:
  On Tue, Mar 05, 2013 at 09:49:54AM -0800, Tejun Heo wrote:
  On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
   So, I think this is why implementing freezer as a separate blocking
   mechanism isn't such a good idea.  We're effectively introducing a
   completely new waiting state to a lot of unsuspecting paths which
   generates a lot of risks and eventually extra complexity to work
   around those.  I think we really should update freezer to re-use the
   blocking points we already have - the ones used for signal delivery
   and ptracing.  That way, other code paths don't have to worry about an
   extra stop state and we can confine most complexities to freezer
   proper.
 
  Also, consolidating those wait states means that we can solve the
  event-to-response latency problem for all three cases - signal, ptrace
  and freezer, rather than adding separate backing-out strategy for
  freezer.
 
  Meanwhile, as none of this sounds likely to be done this time
  around--are we backing out the new lockdep warnings?
 
  --b.
 
 What if we hide it behind a Kconfig? Its finding real bugs.
 
 http://lkml.org/lkml/2013/3/5/583

If it's really just a 2-line patch to try_to_freeze(), could it just be
carried out-of-tree by people that are specifically working on tracking
down these problems?

But I don't have strong feelings about it--as long as it doesn't result
in the same known issues getting reported again and again

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Myklebust, Trond
On Tue, 2013-03-05 at 14:03 -0500, Jeff Layton wrote:
 On Tue, 5 Mar 2013 09:49:54 -0800
 Tejun Heo t...@kernel.org wrote:
 
  On Tue, Mar 05, 2013 at 09:46:48AM -0800, Tejun Heo wrote:
   So, I think this is why implementing freezer as a separate blocking
   mechanism isn't such a good idea.  We're effectively introducing a
   completely new waiting state to a lot of unsuspecting paths which
   generates a lot of risks and eventually extra complexity to work
   around those.  I think we really should update freezer to re-use the
   blocking points we already have - the ones used for signal delivery
   and ptracing.  That way, other code paths don't have to worry about an
   extra stop state and we can confine most complexities to freezer
   proper.
  
  Also, consolidating those wait states means that we can solve the
  event-to-response latency problem for all three cases - signal, ptrace
  and freezer, rather than adding separate backing-out strategy for
  freezer.
  
 
 Sounds intriguing...
 
 I'm not sure what this really means for something like NFS though. How
 would you envision this working when we have long running syscalls that
 might sit waiting in the kernel indefinitely?
 
 Here's my blue-sky, poorly-thought-out idea...
 
 We could add a signal (e.g. SIGFREEZE) that allows the sleeps in
 NFS/RPC layer to be interrupted. Those would return back toward
 userland with a particular type of error (sort of like ERESTARTSYS).
 
 Before returning from the kernel though, we could freeze the process.
 When it wakes up, then we could go back down and retry the call again
 (much like an ERESTARTSYS kind of thing).

Two (three?) show stopper words for you: non-idempotent operations.

Not all RPC calls can just be interrupted and restarted. Something like
an exclusive file create, unlink, file locking attempt, etc may give
rise to different results when replayed in the above scenario.
Interrupting an RPC call is not equivalent to cancelling its effects...

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
Hello, Trond.

On Wed, Mar 06, 2013 at 01:10:07AM +, Myklebust, Trond wrote:
 Not all RPC calls can just be interrupted and restarted. Something like
 an exclusive file create, unlink, file locking attempt, etc may give
 rise to different results when replayed in the above scenario.
 Interrupting an RPC call is not equivalent to cancelling its effects...

Then, the operation simply isn't freezable while in progress and
should be on the receiving end of failed-to-freeze error message and
users who depend on suspend/hibernation working properly should be
advised away from using nfs.

It doesn't really changes anything.  The current code is buggy.  We're
just not enforcing the rules strictly and people aren't hitting the
bug often enough.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
 If it's really just a 2-line patch to try_to_freeze(), could it just be
 carried out-of-tree by people that are specifically working on tracking
 down these problems?
 
 But I don't have strong feelings about it--as long as it doesn't result
 in the same known issues getting reported again and again

Agreed, I don't think a Kconfig option is justified for this.  If this
is really important, annotate broken paths so that it doesn't trigger
spuriously; otherwise, please just remove it.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Tejun Heo
On Tue, Mar 05, 2013 at 05:14:23PM -0800, Tejun Heo wrote:
 Then, the operation simply isn't freezable while in progress and
 should be on the receiving end of failed-to-freeze error message and
 users who depend on suspend/hibernation working properly should be
 advised away from using nfs.
 
 It doesn't really changes anything.  The current code is buggy.  We're
 just not enforcing the rules strictly and people aren't hitting the
 bug often enough.

Just one more thing.  Also, not being able to do retries without
side-effect doesn't mean it can't do retries at all.  There are
syscalls which need to do things differently on re-entry (forgot which
one but there are several).  They record the necessary state in the
restart block and resume from where they left off on re-entry.  It
sure is hairy but is doable and we already have supporting
infrastructure.  Not sure whether that would be worthwhile tho.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-05 Thread Mandeep Singh Baines
On Tue, Mar 5, 2013 at 5:16 PM, Tejun Heo t...@kernel.org wrote:
 On Tue, Mar 05, 2013 at 08:05:07PM -0500, J. Bruce Fields wrote:
 If it's really just a 2-line patch to try_to_freeze(), could it just be
 carried out-of-tree by people that are specifically working on tracking
 down these problems?

 But I don't have strong feelings about it--as long as it doesn't result
 in the same known issues getting reported again and again

 Agreed, I don't think a Kconfig option is justified for this.  If this
 is really important, annotate broken paths so that it doesn't trigger
 spuriously; otherwise, please just remove it.


Fair enough. Let's revert then. I'll rework to use a lockdep annotation.

Maybe, add a new lockdep API:

lockdep_set_held_during_freeze(lock);

Then when we do the check, ignore any locks that set this bit.

Ingo, does this seem like a reasonable design to you?

Regards,
Mandeep

 Thanks.

 --
 tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-04 Thread Myklebust, Trond
On Mon, 2013-03-04 at 21:53 +0100, Oleg Nesterov wrote:
> On 03/04, Mandeep Singh Baines wrote:
> >
> > The problem is that freezer_count() calls try_to_freeze(). In this
> > case, try_to_freeze() is not really adding any value.
> 
> Well, I tend to agree.
> 
> If a task calls __refrigerator() holding a lock which another freezable
> task can wait for, this is not freezer-friendly.
> 
> freezable_schedule/freezer_do_not_count/etc not only means "I won't be
> active if freezing()", it should also mean "I won't block suspend/etc".

If suspend for some reason requires a re-entrant mount, then yes, I can
see how it might be a problem that we're holding a mount-related lock.
The question is why is that necessary?

> OTOH, I understand that probably it is not trivial to change this code
> to make it freezer-friendly. But at least  I disagree with "push your
> problems onto others".

That code can't be made freezer-friendly if it isn't allowed to hold
basic filesystem-related locks across RPC calls. A number of those RPC
calls do things that need to be protected by VFS or MM-level locks.
i.e.: lookups, file creation/deletion, page fault in/out, ...
 
IOW: the problem would need to be solved differently, possibly by adding
a new FIFREEZE-like call to allow the filesystem to quiesce itself
_before_ NetworkManager pulls the rug out from underneath it. There
would still be plenty of corner cases to keep people entertained (e.g.
the server goes down before the quiesce call is invoked) but at least
the top 90% of cases would be solved.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-04 Thread Oleg Nesterov
On 03/04, Mandeep Singh Baines wrote:
>
> The problem is that freezer_count() calls try_to_freeze(). In this
> case, try_to_freeze() is not really adding any value.

Well, I tend to agree.

If a task calls __refrigerator() holding a lock which another freezable
task can wait for, this is not freezer-friendly.

freezable_schedule/freezer_do_not_count/etc not only means "I won't be
active if freezing()", it should also mean "I won't block suspend/etc".

OTOH, I understand that probably it is not trivial to change this code
to make it freezer-friendly. But at least  I disagree with "push your
problems onto others".

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-04 Thread Mandeep Singh Baines
On Mon, Mar 4, 2013 at 12:09 PM, Mandeep Singh Baines  wrote:
> On Mon, Mar 4, 2013 at 7:53 AM, Myklebust, Trond
>  wrote:
>> On Mon, 2013-03-04 at 23:33 +0800, Ming Lei wrote:
>>> Hi,
>>>
>>> CC guys who introduced the lockdep change.
>>>
>>> On Mon, Mar 4, 2013 at 11:04 PM, Jeff Layton  wrote:
>>>
>>> >
>>> > I don't get it -- why is it bad to hold a lock across a freeze event?
>>>
>>> At least this may deadlock another mount.nfs during freezing, :-)
>>>
>>> See detailed explanation in the commit log:
>>>
>>> commit 6aa9707099c4b25700940eb3d016f16c4434360d
>>> Author: Mandeep Singh Baines 
>>> Date:   Wed Feb 27 17:03:18 2013 -0800
>>>
>>> lockdep: check that no locks held at freeze time
>>>
>>> We shouldn't try_to_freeze if locks are held.  Holding a lock can cause 
>>> a
>>> deadlock if the lock is later acquired in the suspend or hibernate path
>>> (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>>> cgroup_freezer if a lock is held inside a frozen cgroup that is later
>>> acquired by a process outside that group.
>>>
>>
>> This is bloody ridiculous... If you want to add functionality to
>> implement cgroup or per-process freezing, then do it through some other
>> api instead of trying to push your problems onto others by adding new
>> global locking rules.
>>
>> Filesystems are a shared resource that have _nothing_ to do with process
>> cgroups. They need to be suspended when the network goes down or other
>> resources that they depend on are suspended. At that point, there is no
>> "what if I launch a new mount command?" scenario.
>>
>
> Hi Trond,
>
> My intention was to introduce new rules. My change simply introduces a

D'oh.

s/was/was not/

Regards,
Mandeep

> check for a deadlock case that can already happen.
>
> I think a deadlock could happen under the following scenario:
>
> 1) An administrator wants to freeze a container. Perhaps to checkpoint
> it and it migrate it some place else.
> 2) An nfs mount was in progress so we hit this code path and freeze
> with a lock held.
> 3) Another container tries to nfs mount.
> 4) Deadlock.
>
> Regards,
> Mandeep
>
>> Trond
>> --
>> Trond Myklebust
>> Linux NFS client maintainer
>>
>> NetApp
>> trond.mykleb...@netapp.com
>> www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-04 Thread Mandeep Singh Baines
On Mon, Mar 4, 2013 at 7:53 AM, Myklebust, Trond
 wrote:
> On Mon, 2013-03-04 at 23:33 +0800, Ming Lei wrote:
>> Hi,
>>
>> CC guys who introduced the lockdep change.
>>
>> On Mon, Mar 4, 2013 at 11:04 PM, Jeff Layton  wrote:
>>
>> >
>> > I don't get it -- why is it bad to hold a lock across a freeze event?
>>
>> At least this may deadlock another mount.nfs during freezing, :-)
>>
>> See detailed explanation in the commit log:
>>
>> commit 6aa9707099c4b25700940eb3d016f16c4434360d
>> Author: Mandeep Singh Baines 
>> Date:   Wed Feb 27 17:03:18 2013 -0800
>>
>> lockdep: check that no locks held at freeze time
>>
>> We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>> deadlock if the lock is later acquired in the suspend or hibernate path
>> (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>> cgroup_freezer if a lock is held inside a frozen cgroup that is later
>> acquired by a process outside that group.
>>
>
> This is bloody ridiculous... If you want to add functionality to
> implement cgroup or per-process freezing, then do it through some other
> api instead of trying to push your problems onto others by adding new
> global locking rules.
>
> Filesystems are a shared resource that have _nothing_ to do with process
> cgroups. They need to be suspended when the network goes down or other
> resources that they depend on are suspended. At that point, there is no
> "what if I launch a new mount command?" scenario.
>

Hi Trond,

My intention was to introduce new rules. My change simply introduces a
check for a deadlock case that can already happen.

I think a deadlock could happen under the following scenario:

1) An administrator wants to freeze a container. Perhaps to checkpoint
it and it migrate it some place else.
2) An nfs mount was in progress so we hit this code path and freeze
with a lock held.
3) Another container tries to nfs mount.
4) Deadlock.

Regards,
Mandeep

> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> trond.mykleb...@netapp.com
> www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-04 Thread Mandeep Singh Baines
+ rjw, akpm, tejun, mingo, oleg

On Mon, Mar 4, 2013 at 6:23 AM, Jeff Layton  wrote:
> On Mon, 4 Mar 2013 14:14:23 +
> "Myklebust, Trond"  wrote:
>
>> On Mon, 2013-03-04 at 21:57 +0800, Ming Lei wrote:
>> > Hi,
>> >
>> > The below warning can be triggered each time when mount.nfs is
>> > running on 3.9-rc1.
>> >
>> > Not sure if freezable_schedule() inside rpc_wait_bit_killable should
>> > be changed to schedule() since nfs_clid_init_mutex is held in the path.
>>
>> Cc:ing Jeff, who added freezable_schedule(), and applied it to
>> rpc_wait_bit_killable.
>>
>> So this is occurring when the kernel enters the freeze state?
>> Why does it occur only with nfs_clid_init_mutex, and not with all the
>> other mutexes that we hold across RPC calls? We hold inode->i_mutex
>> across RPC calls all the time when doing renames, unlinks, file
>> creation,...
>>
>
> cc'ing Mandeep as well since his patch caused this to start popping up...
>
> We've also gotten some similar lockdep pops in the nfsd code recently.
> I responded to an email about it here on Friday, and I'll re-post what
> I said there below:
>
> -[snip]--
> Ok, I see...
>
> rpc_wait_bit_killable() calls freezable_schedule(). That calls
> freezer_count() which calls try_to_freeze(). try_to_freeze does this
> lockdep check now as of commit 6aa9707099.
>
> The assumption seems to be that freezing a thread while holding any
> sort of lock is bad. The rationale in that patch seems a bit sketchy to
> me though. We can be fairly certain that we're not going to deadlock by
> holding these locks, but I guess there could be something I've missed.
>
> Mandeep, can you elaborate on whether there's really a deadlock
> scenario here? If not, then is there some way to annotate these locks
> so this lockdep pop goes away?

We were seeing deadlocks on suspend that were root caused to locks
held at freeze time. In this case, the locks were device locks. I
wanted a way to detect if a process tried to freeze with a device_lock
held.

Then I thought about the cgroup_freezer subsystem (which I recently
turned on in our system). If you were to freeze a cgroup and a process
were holding a lock, you could deadlock. Seems reasonable that this
code path could cause a deadlock that way.

The problem is that freezer_count() calls try_to_freeze(). In this
case, try_to_freeze() is not really adding any value.

If we didn't try_to_freeze() in this instance of freezer_count() you'd
avoid the potential deadlock and not block suspend. I think a lot of
call sites are like that. They want to avoid blocking suspend but
aren't really interested in trying to freeze at the moment they call
freezer_count().

Regards,
Mandeep

> -[snip]--
>
>
>> > [   41.387939] =
>> > [   41.392913] [ BUG: mount.nfs/643 still has locks held! ]
>> > [   41.398559] 3.9.0-rc1+ #1740 Not tainted
>> > [   41.402709] -
>> > [   41.407714] 1 lock held by mount.nfs/643:
>> > [   41.411956]  #0:  (nfs_clid_init_mutex){+.+...}, at: []
>> > nfs4_discover_server_trunking+0x60/0x1d4
>> > [   41.422363]
>> > [   41.422363] stack backtrace:
>> > [   41.427032] [] (unwind_backtrace+0x0/0xe0) from
>> > [] (rpc_wait_bit_killable+0x38/0xc8)
>> > [   41.437103] [] (rpc_wait_bit_killable+0x38/0xc8) from
>> > [] (__wait_on_bit+0x54/0x9c)
>> > [   41.446990] [] (__wait_on_bit+0x54/0x9c) from
>> > [] (out_of_line_wait_on_bit+0x78/0x84)
>> > [   41.457061] [] (out_of_line_wait_on_bit+0x78/0x84) from
>> > [] (__rpc_execute+0x170/0x348)
>> > [   41.467407] [] (__rpc_execute+0x170/0x348) from
>> > [] (rpc_run_task+0x9c/0xa4)
>> > [   41.476715] [] (rpc_run_task+0x9c/0xa4) from []
>> > (rpc_call_sync+0x70/0xb0)
>> > [   41.485778] [] (rpc_call_sync+0x70/0xb0) from
>> > [] (nfs4_proc_setclientid+0x1a0/0x1c8)
>> > [   41.495819] [] (nfs4_proc_setclientid+0x1a0/0x1c8) from
>> > [] (nfs40_discover_server_trunki
>> > ng+0xec/0x148)
>> > [   41.507507] []
>> > (nfs40_discover_server_trunking+0xec/0x148) from []
>> > (nfs4_discover_server
>> > _trunking+0x94/0x1d4)
>> > [   41.519866] [] (nfs4_discover_server_trunking+0x94/0x1d4)
>> > from [] (nfs4_init_client+0x15
>> > 0/0x1b0)
>> > [   41.531036] [] (nfs4_init_client+0x150/0x1b0) from
>> > [] (nfs_get_client+0x2cc/0x320)
>> > [   41.540863] [] (nfs_get_client+0x2cc/0x320) from
>> > [] (nfs4_set_client+0x80/0xb0)
>> > [   41.550476] [] (nfs4_set_client+0x80/0xb0) from
>> > [] (nfs4_create_server+0xb0/0x21c)
>> > [   41.560333] [] (nfs4_create_server+0xb0/0x21c) from
>> > [] (nfs4_remote_mount+0x28/0x54)
>> > [   41.570373] [] (nfs4_remote_mount+0x28/0x54) from
>> > [] (mount_fs+0x6c/0x160)
>> > [   41.579498] [] (mount_fs+0x6c/0x160) from []
>> > (vfs_kern_mount+0x4c/0xc0)
>> > [   41.588378] [] (vfs_kern_mount+0x4c/0xc0) from
>> > [] (nfs_do_root_mount+0x74/0x90)
>> > [   41.597961] [] (nfs_do_root_mount+0x74/0x90) from
>> > [] 

Re: LOCKDEP: 3.9-rc1: mount.nfs/4272 still has locks held!

2013-03-04 Thread Myklebust, Trond
On Mon, 2013-03-04 at 23:33 +0800, Ming Lei wrote:
> Hi,
> 
> CC guys who introduced the lockdep change.
> 
> On Mon, Mar 4, 2013 at 11:04 PM, Jeff Layton  wrote:
> 
> >
> > I don't get it -- why is it bad to hold a lock across a freeze event?
> 
> At least this may deadlock another mount.nfs during freezing, :-)
> 
> See detailed explanation in the commit log:
> 
> commit 6aa9707099c4b25700940eb3d016f16c4434360d
> Author: Mandeep Singh Baines 
> Date:   Wed Feb 27 17:03:18 2013 -0800
> 
> lockdep: check that no locks held at freeze time
> 
> We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
> deadlock if the lock is later acquired in the suspend or hibernate path
> (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
> cgroup_freezer if a lock is held inside a frozen cgroup that is later
> acquired by a process outside that group.
> 

This is bloody ridiculous... If you want to add functionality to
implement cgroup or per-process freezing, then do it through some other
api instead of trying to push your problems onto others by adding new
global locking rules.

Filesystems are a shared resource that have _nothing_ to do with process
cgroups. They need to be suspended when the network goes down or other
resources that they depend on are suspended. At that point, there is no
"what if I launch a new mount command?" scenario.

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >