Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Theodore Tso
On Thu, Feb 21, 2008 at 04:10:17PM -0600, Eric Sandeen wrote:
 This is for RH Bugzilla #433857: 
 rpc.mountd segfaults due to uninitialized value in e2fsprogs devname.c
 
 https://bugzilla.redhat.com/show_bug.cgi?id=433857
 
 which did some very helpful analysis  provided a patch.
 
 This patch is based on that, but checks all the devicemapper calls,
 and does some goto error handling / unwrapping, in the same style as
 the device-mapper lib code itself.

This looks good, but I assume that the bug was caused by some race
condition where if you try to call dm_task_get_info() while some other
process is creating or removing a snapshot, dm_task_get_info() is
returning some kind of EAGAIN, or some other Try again; we're busy
error, right?

If that is the case, can you try to find out what error is being
returned?  It may be the right thing to do is to check to see if we
are getting a resource is locked; try again in a sec error message,
and retry the dm_task_get_info(), instead of just returning a failure.

Thanks!!

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Eric Sandeen
Theodore Tso wrote:
 This looks good, but I assume that the bug was caused by some race
 condition where if you try to call dm_task_get_info() while some other
 process is creating or removing a snapshot, dm_task_get_info() is
 returning some kind of EAGAIN, or some other Try again; we're busy
 error, right?
 
 If that is the case, can you try to find out what error is being
 returned?  It may be the right thing to do is to check to see if we
 are getting a resource is locked; try again in a sec error message,
 and retry the dm_task_get_info(), instead of just returning a failure.

well, dm_task_get_info just returns either 0 or 1; unless there is some
other contextual piece of information to use, I don't know if we can
differentiate between error types.  I'll ask agk...

 Thanks!!
 
   - Ted

-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Theodore Tso
On Fri, Feb 22, 2008 at 09:02:56AM -0600, Eric Sandeen wrote:
 Theodore Tso wrote:
  This looks good, but I assume that the bug was caused by some race
  condition where if you try to call dm_task_get_info() while some other
  process is creating or removing a snapshot, dm_task_get_info() is
  returning some kind of EAGAIN, or some other Try again; we're busy
  error, right?
  
  If that is the case, can you try to find out what error is being
  returned?  It may be the right thing to do is to check to see if we
  are getting a resource is locked; try again in a sec error message,
  and retry the dm_task_get_info(), instead of just returning a failure.
 
 well, dm_task_get_info just returns either 0 or 1; unless there is some
 other contextual piece of information to use, I don't know if we can
 differentiate between error types.  I'll ask agk...

Maybe the right thing is to try 3 times before giving up, maybe with a
nanosleep in between, or some such?  Hopefully agk can give us some
hints about what's the right way to handle errors from all of the
dm_task* calls.

Thanks!!

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Eric Sandeen
Theodore Tso wrote:
 On Fri, Feb 22, 2008 at 09:02:56AM -0600, Eric Sandeen wrote:
 Theodore Tso wrote:
 This looks good, but I assume that the bug was caused by some race
 condition where if you try to call dm_task_get_info() while some other
 process is creating or removing a snapshot, dm_task_get_info() is
 returning some kind of EAGAIN, or some other Try again; we're busy
 error, right?

 If that is the case, can you try to find out what error is being
 returned?  It may be the right thing to do is to check to see if we
 are getting a resource is locked; try again in a sec error message,
 and retry the dm_task_get_info(), instead of just returning a failure.
 well, dm_task_get_info just returns either 0 or 1; unless there is some
 other contextual piece of information to use, I don't know if we can
 differentiate between error types.  I'll ask agk...
 
 Maybe the right thing is to try 3 times before giving up, maybe with a
 nanosleep in between, or some such?  Hopefully agk can give us some
 hints about what's the right way to handle errors from all of the
 dm_task* calls.

From a quick chat with agk, it sounds like outright failure is
appropriate.  Sounds like most of the calls fail for reasons like ENOMEM
(but it might be nice if it returned that, eh?)

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Philip Spencer


This looks good, but I assume that the bug was caused by some race
condition where if you try to call dm_task_get_info() while some other
process is creating or removing a snapshot, dm_task_get_info() is
returning some kind of EAGAIN, or some other Try again; we're busy
error, right?

If that is the case, can you try to find out what error is being
returned?  It may be the right thing to do is to check to see if we
are getting a resource is locked; try again in a sec error message,
and retry the dm_task_get_info(), instead of just returning a failure.

Thanks!!


[ A copy of my posting to RH Bugzilla]

I (the original poster) know very little about either e2fsprogs or 
device-mapper, and had originally just assumed it would be normal for the 
info field to be null after a call to DM_DEVICE_DEPS if there were no 
dependents, but now after a quick look at the sources I see that the info 
field dmi inside the task structure is just what is returned by the 
ioctl, so it does appear to me now that some sort of error occurred, and 
that otherwise it would have returned a non-null dmi with a zero exists 
flag inside it.


Correct me if I'm wrong, but it seems that:

  -- No point in retrying dm_task_get_info(); it is just unpacking the
dmi structure returned by the previous dm_task_run call, which is null.
It is in dm_task_run that the error occurred.

  -- The code in dm_task_run seems to already take care of retrying EAGAIN
 conditions.

  -- One obvious other type of race condition would be if the device were
 removed in between the task creation and call to dm_task_run. In that
 case, Eric's patch seems to do exactly the right thing -- no point in
 continuing if the device is gone anyway.

  -- But, I don't think that's the race condition we're seeing. A gdb
 printout of the task structure shows

 {type = 7, dev_name = 0x2ace3e10 vg1-snapweb-cow, head = 0x0,
  tail = 0x0, read_only = 0, event_nr = 0, major = -1, minor = -1, uid = 0,
  gid = 6, mode = 432, dmi = {v4 = 0x0, v1 = 0x0}, newname = 0x0,
  message = 0x0, geometry = 0x0, sector = 0, no_flush = 0, no_open_count = 0,
  skip_lockfs = 0, suppress_identical_reload = 0, uuid = 0x0}

This is associated to the snapshot volume snapweb which was being backed 
up at the time. Timestamps on the backup logs indicate that my backup 
script moved on to the next filesystem 30 seconds AFTER the segfault, so, 
unless something really slowed down the system so that deallocation of the 
snapweb volume took a full 30 seconds, it does not appear that the 
segfault occurred during the unmounting and deallocating of snapweb.


I also don't understand why major/minor are -1 in the above structure; is 
that normal?


- Philip

+---
Philip Spencer  [EMAIL PROTECTED] | Director of Computing Services
Room 336(416)-348-9710  ext3036 | The Fields Institute for
222 College St, Toronto ON M5T 3J1 Canada   | Research in Mathematical Sciences

On Fri, 22 Feb 2008, Theodore Tso wrote:


On Thu, Feb 21, 2008 at 04:10:17PM -0600, Eric Sandeen wrote:

This is for RH Bugzilla #433857:
rpc.mountd segfaults due to uninitialized value in e2fsprogs devname.c

https://bugzilla.redhat.com/show_bug.cgi?id=433857

which did some very helpful analysis  provided a patch.

This patch is based on that, but checks all the devicemapper calls,
and does some goto error handling / unwrapping, in the same style as
the device-mapper lib code itself.


- Ted


-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Theodore Tso
On Fri, Feb 22, 2008 at 10:16:53AM -0600, Eric Sandeen wrote:
 From a quick chat with agk, it sounds like outright failure is
 appropriate.  Sounds like most of the calls fail for reasons like ENOMEM
 (but it might be nice if it returned that, eh?)

So the question then is why is it that Phillip was able to seeing
failures when he was creating and deleting snapshots?

I don't mind having blkid return a failure, but it may not fix
Phillip's scenario which he listed in BZ #433857; yeah, he won't have
a core dump, which is good, but it might mean that some or all of the
dm volumes disappear from the blkid results.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Eric Sandeen
Theodore Tso wrote:
 On Fri, Feb 22, 2008 at 10:16:53AM -0600, Eric Sandeen wrote:
 From a quick chat with agk, it sounds like outright failure is
 appropriate.  Sounds like most of the calls fail for reasons like ENOMEM
 (but it might be nice if it returned that, eh?)
 
 So the question then is why is it that Phillip was able to seeing
 failures when he was creating and deleting snapshots?
 
 I don't mind having blkid return a failure, but it may not fix
 Phillip's scenario which he listed in BZ #433857; yeah, he won't have
 a core dump, which is good, but it might mean that some or all of the
 dm volumes disappear from the blkid results.

Maybe a device-mapper bug is in order :)

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Philip Spencer

You know what -- I went back and double-checked all the logs, and somehow
or other I must have recorded a timestamp wrong as 3:19:21 instead of 
3:19:51.


The segfault did in fact happen at 3:19:51 a.m. which is exactly the same 
time as my backup script moved on to the next filesystem.


So, it occurred during the unmount and lvremove of the snapshot volume.
It is, then, entirely expected that the device-mapper routines would 
return an error if the device no longer existed when the task was run.


My apologies for mixing up the timestamps! And no bug in device-mapper, 
just the one in e2fsprogs whch segfaulted in this circumstance instead of 
dropping the device from its list. Having it fail outright, and not list 
the device at all, is the correct behaviour for this situation -- just as 
if the device had already been removed before the blkid routines were run.


- Philip

On Fri, 22 Feb 2008, Theodore Tso wrote:


On Fri, Feb 22, 2008 at 10:16:53AM -0600, Eric Sandeen wrote:

From a quick chat with agk, it sounds like outright failure is
appropriate.  Sounds like most of the calls fail for reasons like ENOMEM
(but it might be nice if it returned that, eh?)


So the question then is why is it that Phillip was able to seeing
failures when he was creating and deleting snapshots?

I don't mind having blkid return a failure, but it may not fix
Phillip's scenario which he listed in BZ #433857; yeah, he won't have
a core dump, which is good, but it might mean that some or all of the
dm volumes disappear from the blkid results.

- Ted



+---
Philip Spencer  [EMAIL PROTECTED] | Director of Computing Services
Room 336(416)-348-9710  ext3036 | The Fields Institute for
222 College St, Toronto ON M5T 3J1 Canada   | Research in Mathematical Sciences
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Theodore Tso
On Fri, Feb 22, 2008 at 10:52:36AM -0600, Eric Sandeen wrote:
 Theodore Tso wrote:
  On Fri, Feb 22, 2008 at 10:16:53AM -0600, Eric Sandeen wrote:
  From a quick chat with agk, it sounds like outright failure is
  appropriate.  Sounds like most of the calls fail for reasons like ENOMEM
  (but it might be nice if it returned that, eh?)
  
  So the question then is why is it that Phillip was able to seeing
  failures when he was creating and deleting snapshots?
  
  I don't mind having blkid return a failure, but it may not fix
  Phillip's scenario which he listed in BZ #433857; yeah, he won't have
  a core dump, which is good, but it might mean that some or all of the
  dm volumes disappear from the blkid results.
 
 Maybe a device-mapper bug is in order :)

Yep, especially if it can be easily reproduced.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] e2fsprogs: error checking in blkid/devname.c

2008-02-22 Thread Theodore Tso
On Fri, Feb 22, 2008 at 01:10:40PM -0500, Philip Spencer wrote:
 You know what -- I went back and double-checked all the logs, and somehow
 or other I must have recorded a timestamp wrong as 3:19:21 instead of 
 3:19:51.

 The segfault did in fact happen at 3:19:51 a.m. which is exactly the same 
 time as my backup script moved on to the next filesystem.

 So, it occurred during the unmount and lvremove of the snapshot volume.
 It is, then, entirely expected that the device-mapper routines would return 
 an error if the device no longer existed when the task was run.

 My apologies for mixing up the timestamps! And no bug in device-mapper, 
 just the one in e2fsprogs whch segfaulted in this circumstance instead of 
 dropping the device from its list. Having it fail outright, and not list 
 the device at all, is the correct behaviour for this situation -- just as 
 if the device had already been removed before the blkid routines were run.

OK, that's helpful, to know.  Thanks!!!

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-ext4 in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html