Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
On Thu, Oct 25, 2007 at 02:46:17PM -0400, Jun'ichi Nomura wrote: > So as far as I understand, the point is: > 1. it's preferable to resize inode after the resume, if possible Not quite - I'm not expressing a preference yet. I'm saying the patches you presented were one option to resolve the problem, but it was not obvious to me they are the only option and I wanted to see a deeper analysis (keeping in mind the general direction I want the device-mapper code to be heading in). In other words, an analysis that draws out the conflict at the heart of this problem, and whether the problem we have with this locking is intrinsic to the current design or just a matter of implementation that can be fixed with some re-ordering or tweaks to the usage rules or types of the locks. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Alasdair, Jun'ichi Nomura wrote: > So as far as I understand, the point is: > 1. it's preferable to resize inode after the resume, if possible Attached is a modified version of the (2/3) patch. - The logic is basically the same as before. - Moved the set-size function outside of the table swapping. - Moved the uevent to the end of resize from the end of resume. - Lightly tested on LVM2 mirror resizing. - (1/3) and (3/3) are unchanged. What do you think about this? --- drivers/md/dm-ioctl.c |4 ++ drivers/md/dm.c | 61 ++ include/linux/device-mapper.h |5 +++ 3 files changed, 46 insertions(+), 24 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1099,31 +1099,11 @@ static void event_callback(void *context wake_up(>eventq); } -static void __set_size(struct mapped_device *md, sector_t size) -{ - set_capacity(md->disk, size); - - mutex_lock(>suspended_bdev->bd_inode->i_mutex); - i_size_write(md->suspended_bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); - mutex_unlock(>suspended_bdev->bd_inode->i_mutex); -} - static int __bind(struct mapped_device *md, struct dm_table *t) { struct request_queue *q = md->queue; - sector_t size; - - size = dm_table_get_size(t); - - /* -* Wipe any geometry if the size of the table changed. -*/ - if (size != get_capacity(md->disk)) - memset(>geometry, 0, sizeof(md->geometry)); - if (md->suspended_bdev) - __set_size(md, size); - if (size == 0) + if (!dm_table_get_size(t)) return 0; dm_table_get(t); @@ -1497,8 +1477,6 @@ int dm_resume(struct mapped_device *md) dm_table_unplug_all(map); - kobject_uevent(>disk->kobj, KOBJ_CHANGE); - r = 0; out: @@ -1508,6 +1486,43 @@ out: return r; } +void dm_set_size(struct mapped_device *md) +{ + struct dm_table *map; + struct block_device *bdev; + sector_t size; + + down(>suspend_lock); + + map = dm_get_table(md); + if (!map) + goto out; + + size = dm_table_get_size(map); + + /* +* Wipe any geometry if the size of the table changed. +*/ + if (size != get_capacity(md->disk)) + memset(>geometry, 0, sizeof(md->geometry)); + + set_capacity(md->disk, size); + + bdev = bdlookup_disk(md->disk, 0); + if (bdev) { + mutex_lock(>bd_mutex); + i_size_write(bdev->bd_inode, (loff_t)size << SECTOR_SHIFT); + mutex_unlock(>bd_mutex); + bdput(bdev); + } + + kobject_uevent(>disk->kobj, KOBJ_CHANGE); + +out: + dm_table_put(map); + up(>suspend_lock); +} + /*- * Event notification. *---*/ Index: linux-2.6.23.work/drivers/md/dm-ioctl.c === --- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c +++ linux-2.6.23.work/drivers/md/dm-ioctl.c @@ -840,8 +840,10 @@ static int do_resume(struct dm_ioctl *pa if (dm_suspended(md)) r = dm_resume(md); - if (!r) + if (!r) { + dm_set_size(md); r = __dev_status(md, param); + } dm_put(md); return r; Index: linux-2.6.23.work/include/linux/device-mapper.h === --- linux-2.6.23.work.orig/include/linux/device-mapper.h +++ linux-2.6.23.work/include/linux/device-mapper.h @@ -198,6 +198,11 @@ int dm_noflush_suspending(struct dm_targ int dm_get_geometry(struct mapped_device *md, struct hd_geometry *geo); int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo); +/* + * Update device size + */ +void dm_set_size(struct mapped_device *md); + /*- * Functions for manipulating device-mapper tables. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Thanks for the immediate reply. So as far as I understand, the point is: 1. it's preferable to resize inode after the resume, if possible 2. it's necessary to have nowait version of bdget() to avoid stall For the 1st item, I guess the reason is that we want to make the table swapping code deterministic. Is it correct? Even if we don't have to wait for bdget(), you like to have the resizing code after the resume? For the 2nd item, the patch included bdlookup() for that purpose. What do you think about it? I added below some additional comments and questions. Alasdair G Kergon wrote: > On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote: >> There is no guarantee that the I/O flowing through the device again. >> The table might need be replaced again, but to do that, the resume >> should have been completed to let the userspace know it. > > Then the first attempt to set the size could be made to fail > (because it could not get the lock immediately) and the > size could be set after the second resume instead. > > - Setting the size would lag behind the actual size the dm table was > supporting, but (given the usage cases discussed) this would not matter. > >> bdget() in noflush suspend has a possibility of stall. > > So we cannot avoid fixing that: we require immediate return > with failure instead of waiting. bdlookup() is the almost nowait version of bdget() ("almost" means it may wait for I_NEW turned off in case there is bdget() doing its latter half. But it never stalls.) Do you think we need something else? >> OTOH, calling bdget() and i_size_write() outside of the lock >> can cause race with other table swapping and may result in setting >> wrong device size. > > If the size setting is removed from the lock, then it becomes > "set the inode size to match the current size of the table" and > races would not matter - each "set size" attempt would set it > to the instantaneous live table size, not a cached value that > could be out-of-date. Even though, I think we still want set_capacity() during the table swapping. So we need to call dm_table_get_size() twice. Is it ok? Also, do you want to move the resizing code for normal suspend, too? Otherwise, the code will be duplicated and I don't think you like it. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote: > There is no guarantee that the I/O flowing through the device again. > The table might need be replaced again, but to do that, the resume > should have been completed to let the userspace know it. Then the first attempt to set the size could be made to fail (because it could not get the lock immediately) and the size could be set after the second resume instead. - Setting the size would lag behind the actual size the dm table was supporting, but (given the usage cases discussed) this would not matter. > bdget() in noflush suspend has a possibility of stall. So we cannot avoid fixing that: we require immediate return with failure instead of waiting. > OTOH, calling bdget() and i_size_write() outside of the lock > can cause race with other table swapping and may result in setting > wrong device size. If the size setting is removed from the lock, then it becomes "set the inode size to match the current size of the table" and races would not matter - each "set size" attempt would set it to the instantaneous live table size, not a cached value that could be out-of-date. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Alasdair G Kergon wrote: > Before reviewing the details of the proposed workaround, I'd like to see > a deeper analysis of the problem to see that there isn't a cleaner way > to resolve this. OK. Let me try. > For example: > > Question) What are the realistic situations we must support that lead to > a resize during table reload with I/O outstanding? 'noflush' is currently the only option for userspace to suspend without risking otherwise-avoidable I/O lost, because failure of underlying device might occur *after* the userspace started normal suspend but *before* the flushing completes. (normal suspend = flushing suspend) E.g. if userspace wants to resize dm-mpath device with queue_if_no_path, it has to use 'noflush' suspend on table swapping. Otherwise, path failures during the suspend will cause I/O error, though queue_if_no_path is specified. Other possible solution would be allowing the suspend to fail if it can't flush all I/O and letting userspace to retry after fixing the device failure. > - The resize is the purpose of the reload; noflush is only set to avoid losing > I/O if a path should fail. So any outstanding I/O may be expected to be > consistent with both the old and new sizes of the device. E.g. If it's > beyond the end of a shrinking device and userspace cared about not > losing that I/O, it would have waited for that I/O to be flushed > *before* issuing the resize. If the I/O is beyond the end of the > existing device but within the new size, userspace would have waited for > the resize operation to complete before allowing the new I/O to be > issued. If userspace cares about not losing I/O, it should wait for the I/O before trying to shrink the device size. After the shrinking started, any I/O beyond the new end of the device would have a possibility of lost. Reducing the size of the device while actively running I/O on it would anyway have a possibility of losing some of the I/O. > If the I/O is beyond the end of the > existing device but within the new size, userspace would have waited for > the resize operation to complete before allowing the new I/O to be > issued. Issuing the I/O beyond the end of the existing device would get error. If the issuer knows the device will be extended, it should wait for the completion of the extention. If it doesn't know, such I/O won't be issued. > => Is it OK for device-mapper to handle the device size check > internally, rejecting any I/O that falls beyond the end of the table (it I think this check is needed in current dm, regardless of noflush. > already must do this lookup anyway), and to update the size recorded in > the inode later, after I/O is flowing through the device again, but (of > course) before reporting that the resize operation is complete? > I.e. does it eliminate deadlocks if the bdget() and i_size_write() > happen after the 'resume'? There is no guarantee that the I/O flowing through the device again. The table might need be replaced again, but to do that, the resume should have been completed to let the userspace know it. bdget() in noflush suspend has a possibility of stall. Once it stalls, the remedy is userspace to replace the table with working one. However, since bdget() occurs inside of suspend_lock, it's not possible to run other instance of suspend/resume. OTOH, calling bdget() and i_size_write() outside of the lock can cause race with other table swapping and may result in setting wrong device size. Also, bdget() allocates new bdev inode if there isn't. But dm wants to access bdev inode only when it exists in memory. So something like bdlookup() will fit for the purpose, IMO. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Alasdair G Kergon wrote: Before reviewing the details of the proposed workaround, I'd like to see a deeper analysis of the problem to see that there isn't a cleaner way to resolve this. OK. Let me try. For example: Question) What are the realistic situations we must support that lead to a resize during table reload with I/O outstanding? 'noflush' is currently the only option for userspace to suspend without risking otherwise-avoidable I/O lost, because failure of underlying device might occur *after* the userspace started normal suspend but *before* the flushing completes. (normal suspend = flushing suspend) E.g. if userspace wants to resize dm-mpath device with queue_if_no_path, it has to use 'noflush' suspend on table swapping. Otherwise, path failures during the suspend will cause I/O error, though queue_if_no_path is specified. Other possible solution would be allowing the suspend to fail if it can't flush all I/O and letting userspace to retry after fixing the device failure. - The resize is the purpose of the reload; noflush is only set to avoid losing I/O if a path should fail. So any outstanding I/O may be expected to be consistent with both the old and new sizes of the device. E.g. If it's beyond the end of a shrinking device and userspace cared about not losing that I/O, it would have waited for that I/O to be flushed *before* issuing the resize. If the I/O is beyond the end of the existing device but within the new size, userspace would have waited for the resize operation to complete before allowing the new I/O to be issued. If userspace cares about not losing I/O, it should wait for the I/O before trying to shrink the device size. After the shrinking started, any I/O beyond the new end of the device would have a possibility of lost. Reducing the size of the device while actively running I/O on it would anyway have a possibility of losing some of the I/O. If the I/O is beyond the end of the existing device but within the new size, userspace would have waited for the resize operation to complete before allowing the new I/O to be issued. Issuing the I/O beyond the end of the existing device would get error. If the issuer knows the device will be extended, it should wait for the completion of the extention. If it doesn't know, such I/O won't be issued. = Is it OK for device-mapper to handle the device size check internally, rejecting any I/O that falls beyond the end of the table (it I think this check is needed in current dm, regardless of noflush. already must do this lookup anyway), and to update the size recorded in the inode later, after I/O is flowing through the device again, but (of course) before reporting that the resize operation is complete? I.e. does it eliminate deadlocks if the bdget() and i_size_write() happen after the 'resume'? There is no guarantee that the I/O flowing through the device again. The table might need be replaced again, but to do that, the resume should have been completed to let the userspace know it. bdget() in noflush suspend has a possibility of stall. Once it stalls, the remedy is userspace to replace the table with working one. However, since bdget() occurs inside of suspend_lock, it's not possible to run other instance of suspend/resume. OTOH, calling bdget() and i_size_write() outside of the lock can cause race with other table swapping and may result in setting wrong device size. Also, bdget() allocates new bdev inode if there isn't. But dm wants to access bdev inode only when it exists in memory. So something like bdlookup() will fit for the purpose, IMO. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote: There is no guarantee that the I/O flowing through the device again. The table might need be replaced again, but to do that, the resume should have been completed to let the userspace know it. Then the first attempt to set the size could be made to fail (because it could not get the lock immediately) and the size could be set after the second resume instead. - Setting the size would lag behind the actual size the dm table was supporting, but (given the usage cases discussed) this would not matter. bdget() in noflush suspend has a possibility of stall. So we cannot avoid fixing that: we require immediate return with failure instead of waiting. OTOH, calling bdget() and i_size_write() outside of the lock can cause race with other table swapping and may result in setting wrong device size. If the size setting is removed from the lock, then it becomes set the inode size to match the current size of the table and races would not matter - each set size attempt would set it to the instantaneous live table size, not a cached value that could be out-of-date. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Hi Alasdair, Thanks for the immediate reply. So as far as I understand, the point is: 1. it's preferable to resize inode after the resume, if possible 2. it's necessary to have nowait version of bdget() to avoid stall For the 1st item, I guess the reason is that we want to make the table swapping code deterministic. Is it correct? Even if we don't have to wait for bdget(), you like to have the resizing code after the resume? For the 2nd item, the patch included bdlookup() for that purpose. What do you think about it? I added below some additional comments and questions. Alasdair G Kergon wrote: On Thu, Oct 25, 2007 at 10:18:02AM -0400, Jun'ichi Nomura wrote: There is no guarantee that the I/O flowing through the device again. The table might need be replaced again, but to do that, the resume should have been completed to let the userspace know it. Then the first attempt to set the size could be made to fail (because it could not get the lock immediately) and the size could be set after the second resume instead. - Setting the size would lag behind the actual size the dm table was supporting, but (given the usage cases discussed) this would not matter. bdget() in noflush suspend has a possibility of stall. So we cannot avoid fixing that: we require immediate return with failure instead of waiting. bdlookup() is the almost nowait version of bdget() (almost means it may wait for I_NEW turned off in case there is bdget() doing its latter half. But it never stalls.) Do you think we need something else? OTOH, calling bdget() and i_size_write() outside of the lock can cause race with other table swapping and may result in setting wrong device size. If the size setting is removed from the lock, then it becomes set the inode size to match the current size of the table and races would not matter - each set size attempt would set it to the instantaneous live table size, not a cached value that could be out-of-date. Even though, I think we still want set_capacity() during the table swapping. So we need to call dm_table_get_size() twice. Is it ok? Also, do you want to move the resizing code for normal suspend, too? Otherwise, the code will be duplicated and I don't think you like it. Thanks, -- Jun'ichi Nomura, NEC Corporation of America - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
Alasdair, Jun'ichi Nomura wrote: So as far as I understand, the point is: 1. it's preferable to resize inode after the resume, if possible Attached is a modified version of the (2/3) patch. - The logic is basically the same as before. - Moved the set-size function outside of the table swapping. - Moved the uevent to the end of resize from the end of resume. - Lightly tested on LVM2 mirror resizing. - (1/3) and (3/3) are unchanged. What do you think about this? --- drivers/md/dm-ioctl.c |4 ++ drivers/md/dm.c | 61 ++ include/linux/device-mapper.h |5 +++ 3 files changed, 46 insertions(+), 24 deletions(-) Index: linux-2.6.23.work/drivers/md/dm.c === --- linux-2.6.23.work.orig/drivers/md/dm.c +++ linux-2.6.23.work/drivers/md/dm.c @@ -1099,31 +1099,11 @@ static void event_callback(void *context wake_up(md-eventq); } -static void __set_size(struct mapped_device *md, sector_t size) -{ - set_capacity(md-disk, size); - - mutex_lock(md-suspended_bdev-bd_inode-i_mutex); - i_size_write(md-suspended_bdev-bd_inode, (loff_t)size SECTOR_SHIFT); - mutex_unlock(md-suspended_bdev-bd_inode-i_mutex); -} - static int __bind(struct mapped_device *md, struct dm_table *t) { struct request_queue *q = md-queue; - sector_t size; - - size = dm_table_get_size(t); - - /* -* Wipe any geometry if the size of the table changed. -*/ - if (size != get_capacity(md-disk)) - memset(md-geometry, 0, sizeof(md-geometry)); - if (md-suspended_bdev) - __set_size(md, size); - if (size == 0) + if (!dm_table_get_size(t)) return 0; dm_table_get(t); @@ -1497,8 +1477,6 @@ int dm_resume(struct mapped_device *md) dm_table_unplug_all(map); - kobject_uevent(md-disk-kobj, KOBJ_CHANGE); - r = 0; out: @@ -1508,6 +1486,43 @@ out: return r; } +void dm_set_size(struct mapped_device *md) +{ + struct dm_table *map; + struct block_device *bdev; + sector_t size; + + down(md-suspend_lock); + + map = dm_get_table(md); + if (!map) + goto out; + + size = dm_table_get_size(map); + + /* +* Wipe any geometry if the size of the table changed. +*/ + if (size != get_capacity(md-disk)) + memset(md-geometry, 0, sizeof(md-geometry)); + + set_capacity(md-disk, size); + + bdev = bdlookup_disk(md-disk, 0); + if (bdev) { + mutex_lock(bdev-bd_mutex); + i_size_write(bdev-bd_inode, (loff_t)size SECTOR_SHIFT); + mutex_unlock(bdev-bd_mutex); + bdput(bdev); + } + + kobject_uevent(md-disk-kobj, KOBJ_CHANGE); + +out: + dm_table_put(map); + up(md-suspend_lock); +} + /*- * Event notification. *---*/ Index: linux-2.6.23.work/drivers/md/dm-ioctl.c === --- linux-2.6.23.work.orig/drivers/md/dm-ioctl.c +++ linux-2.6.23.work/drivers/md/dm-ioctl.c @@ -840,8 +840,10 @@ static int do_resume(struct dm_ioctl *pa if (dm_suspended(md)) r = dm_resume(md); - if (!r) + if (!r) { + dm_set_size(md); r = __dev_status(md, param); + } dm_put(md); return r; Index: linux-2.6.23.work/include/linux/device-mapper.h === --- linux-2.6.23.work.orig/include/linux/device-mapper.h +++ linux-2.6.23.work/include/linux/device-mapper.h @@ -198,6 +198,11 @@ int dm_noflush_suspending(struct dm_targ int dm_get_geometry(struct mapped_device *md, struct hd_geometry *geo); int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo); +/* + * Update device size + */ +void dm_set_size(struct mapped_device *md); + /*- * Functions for manipulating device-mapper tables. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
On Thu, Oct 25, 2007 at 02:46:17PM -0400, Jun'ichi Nomura wrote: So as far as I understand, the point is: 1. it's preferable to resize inode after the resume, if possible Not quite - I'm not expressing a preference yet. I'm saying the patches you presented were one option to resolve the problem, but it was not obvious to me they are the only option and I wanted to see a deeper analysis (keeping in mind the general direction I want the device-mapper code to be heading in). In other words, an analysis that draws out the conflict at the heart of this problem, and whether the problem we have with this locking is intrinsic to the current design or just a matter of implementation that can be fixed with some re-ordering or tweaks to the usage rules or types of the locks. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
On Wed, Oct 24, 2007 at 05:25:17PM -0400, Jun'ichi Nomura wrote: > - For some device-mapper targets (multipath and mirror), > the mapping table sometimes has to be replaced to cope with device > failure. > OTOH, device-mapper flushes all pending I/Os upon table replacement > and may result in I/O errors, if there are device failures. > 'noflush' suspend is used to let dm queue the pending I/Os > instead of flushing them. > Since it's not possible for user space program to tell whether > the suspend could cause I/O error, they always use > 'noflush' to suspend mirror/multipath targets. > > - Currently resizing is disabled for 'noflush' suspend. > Resizing occurs in the course of table replacement. > To resize the device under use, device-mapper needs to get its > bdev inode. However, using bdget() in this case could cause deadlock > by waiting for I_LOCK where an I/O process holding I_LOCK is > waiting for completion of table replacement. Before reviewing the details of the proposed workaround, I'd like to see a deeper analysis of the problem to see that there isn't a cleaner way to resolve this. For example: Question) What are the realistic situations we must support that lead to a resize during table reload with I/O outstanding? - The resize is the purpose of the reload; noflush is only set to avoid losing I/O if a path should fail. So any outstanding I/O may be expected to be consistent with both the old and new sizes of the device. E.g. If it's beyond the end of a shrinking device and userspace cared about not losing that I/O, it would have waited for that I/O to be flushed *before* issuing the resize. If the I/O is beyond the end of the existing device but within the new size, userspace would have waited for the resize operation to complete before allowing the new I/O to be issued. => Is it OK for device-mapper to handle the device size check internally, rejecting any I/O that falls beyond the end of the table (it already must do this lookup anyway), and to update the size recorded in the inode later, after I/O is flowing through the device again, but (of course) before reporting that the resize operation is complete? I.e. does it eliminate deadlocks if the bdget() and i_size_write() happen after the 'resume'? Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [dm-devel] [PATCH] dm: noflush resizing (0/3)
On Wed, Oct 24, 2007 at 05:25:17PM -0400, Jun'ichi Nomura wrote: - For some device-mapper targets (multipath and mirror), the mapping table sometimes has to be replaced to cope with device failure. OTOH, device-mapper flushes all pending I/Os upon table replacement and may result in I/O errors, if there are device failures. 'noflush' suspend is used to let dm queue the pending I/Os instead of flushing them. Since it's not possible for user space program to tell whether the suspend could cause I/O error, they always use 'noflush' to suspend mirror/multipath targets. - Currently resizing is disabled for 'noflush' suspend. Resizing occurs in the course of table replacement. To resize the device under use, device-mapper needs to get its bdev inode. However, using bdget() in this case could cause deadlock by waiting for I_LOCK where an I/O process holding I_LOCK is waiting for completion of table replacement. Before reviewing the details of the proposed workaround, I'd like to see a deeper analysis of the problem to see that there isn't a cleaner way to resolve this. For example: Question) What are the realistic situations we must support that lead to a resize during table reload with I/O outstanding? - The resize is the purpose of the reload; noflush is only set to avoid losing I/O if a path should fail. So any outstanding I/O may be expected to be consistent with both the old and new sizes of the device. E.g. If it's beyond the end of a shrinking device and userspace cared about not losing that I/O, it would have waited for that I/O to be flushed *before* issuing the resize. If the I/O is beyond the end of the existing device but within the new size, userspace would have waited for the resize operation to complete before allowing the new I/O to be issued. = Is it OK for device-mapper to handle the device size check internally, rejecting any I/O that falls beyond the end of the table (it already must do this lookup anyway), and to update the size recorded in the inode later, after I/O is flowing through the device again, but (of course) before reporting that the resize operation is complete? I.e. does it eliminate deadlocks if the bdget() and i_size_write() happen after the 'resume'? Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/